In [1]:
import pyarrow.parquet as pq
import pandas as pd
import pyarrow as pa

## Create a Parquet file

How can we create a Parquet file in Python?

Let's start from a Python DataFrame

In [2]:
df = pd.DataFrame(
    {
        'one': [-1, 0, 2.5],
        'two': ['foo', 'bar', 'baz'],
        'three': [True, False, True]
    },
    index=list('abc')
)

We then use the Apache Arrow _specification_.


> Apache Arrow was born from the need for a **set of standards** around tabular data representation and interchange between systems. The adoption of these standards reduces computing costs of data serialization/deserialization and implementation costs across systems implemented in different programming languages.

In Python, we can use PyArrow, the Python implementation of the Arrow specifications.

In [4]:

# pyarrow.Table object
"""The PyArrow Table type is not part of the Apache Arrow specification, but is rather a tool to help with wrangling multiple record batches and array pieces as a single logical dataset. As a relevant example, we may receive multiple small record batches in a socket stream, then need to concatenate them into contiguous memory for use in NumPy or pandas. The Table object makes this efficient without requiring additional memory copying."""
table = pa.Table.from_pandas(df)

# https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table
pq.write_table(table, 'example.parquet')

Parquet metadata...

In [9]:
pq.read_metadata('example.parquet')

<pyarrow._parquet.FileMetaData object at 0x12f8c4220>
  created_by: parquet-cpp-arrow version 18.0.0
  num_columns: 4
  num_rows: 3
  num_row_groups: 1
  format_version: 2.6
  serialized_size: 2572

What is metadata?
https://parquet.apache.org/docs/file-format/metadata/
We have just read the `FileMetadata`

> The file metadata is described by the `FileMetaData` structure. This file metadata provides offset and size information useful when navigating the Parquet file. 

![Parquet Metadata](docs/FileFormat.gif)

Let's take a larger file

In [7]:
df_large = pd.read_csv('./geographic-units-by-industry-and-statistical-area-2000-2024-descending-order/geographic-units-by-industry-and-statistical-area-2000-2024-descending-order-february-2024.csv')
print(df_large.head())
print(df_large.shape)

  anzsic06     Area  year  geo_count  ec_count
0        A  A100100  2024         87       200
1        A  A100200  2024        135       210
2        A  A100301  2024          6        35
3        A  A100400  2024         54        35
4        A  A100500  2024         51        95
(6751326, 5)


Imagine 6M records is large, we can partition

In [8]:
distinct_years = df_large['year'].unique()
print(distinct_years)

[2024 2023 2022 2021 2020 2019 2018 2017 2016 2015 2014 2013 2012 2011
 2010 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000]


In [None]:
pq.write_table(
    pa.Table.from_pandas(df_large),
    'example_large.parquet',
    ProgressBar
)