# Getting started with Parquet

This series introduces Apache Parquet:

* What is Parquet?
* Developing with Parquet
* Writing Parquet
* Inspecting Parquet files
* Querying Parquet files
* Summary

After going through the notebook, you should have basic understanding of what Apache Parquet is and how to read and write it from Python. You are encourage to follow the lines of code and to try to understanding what is happening in each line.

## What is Parquet?

Parquet is

* An [Open source file format](https://parquet.apache.org/) for column-oriented storage and bulk transfer of data under Apache governance.
* Readable and writable from [all mainstream programming languages and many database systems](https://parquet.apache.org/docs/file-format/implementationstatus/).
* Based on an innovative algorithm for seamless "shredding" and reassembling arbitrarily complex data structures with  nested structures into flat columns [originally from Google Research](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf).

Column-oriented storage is highly efficient for analytics. Consider this IoT example:

* A machine sends every second readings from 100 different sensors.
* You want to analyze the historical values of just **one** of those sensors.
* **Row-based formats** (imagine CSV) store each time-stamped reading as a complete row with all 100 sensor values. To analyze one sensor, the system must read all 100 columns.
* **Column-based formats** (like Parquet) group all values for a single sensor into a column. To analyze one sensor, the system only reads that specific column, dramatically reducing I/O.
* This also improves compression, as storing similar data together is more efficient.

![Row- vs column-oriented](../images/row-vs-column.png)

> **This makes Parquet a great match for IoT analytics.**

## Developing with Parquet in Python

The notebooks focus on usage from Python. What do you need for developing with Parquet in Python?

* PyArrow: A library for reading and writing Parquet files from Python.
* DuckDB: An embeddable query engine for querying Parquet files using SQL.
* Pandas and a graphing library: For visualizing data.

There are also some helper functions for working with Parquet content included in this repository. Please ensure to install the libraries into your Python environment. E.g., in Visual Studio Code with Python supoprt enabled, you can select "Python: Create environment" and mark "requirements.txt" to install the libraries. Or use

```
pip install -r requirements.txt
```

In [None]:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import duckdb
from pathlib import Path
from IPython.display import display, HTML

print(f"Pandas version: {pd.__version__}")
print(f"PyArrow version: {pa.__version__}")
print(f"DuckDB version: {duckdb.__version__}")

# Make the display of data frames use more wide tables.
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 160)

%reload_ext autoreload
%autoreload 2
from helpers import read_jsonl, inspect, compare_sizes

## Writing Parquet files

What is the most simple way to write a Parquet file? 

* Use PyArrow to create a table in memory.
* Write the table to a Parquet file.

Let's write a table with just a single row having a single column.

In [None]:
tiny_table = pa.Table.from_pylist([{'value': 42}])
minimal_parquet_file = Path('../data/output/minimal.parquet')
pq.write_table(tiny_table, minimal_parquet_file)

## Inspecting Parquet files

What is actually inside a Parquet file? 

* Data is horizontally partitioned into larger **row groups** (e.g., several MBs).
* Within each row group, data is stored column-by-column in **column chunks**.
* Each column chunk is divided into **data pages**, which is where data is encoded and compressed.
* A **file footer** contains all metadata, including the schema and statistics (like min/max values) for each column chunk. This allows query engines to skip reading unnecessary data.

```
Parquet file
└─ Magic number
└─ Row group (horizontal partitioning)
   └─ Column chunk (vertical - one per column)
      └─ Data page (optional dictionary page followed by actual data pages)
         ├─ Page header
         └─ Actual compressed/encoded data
└─ File footer
   └─ Metadata
   └─ Footer length
   └─ Magic number
```

![Parquet file layout](../images/FileLayout.gif)

Parquet's read performance is enabled by a "footer-first" approach. Here's how a query engine efficiently finds data without scanning the whole file:

* **Read the Footer:** The engine reads the small footer at the end of the file to get a "map" of its contents. 
* **Use Statistics to Skip Data:** It consults the metadata and statistics (like min/max values) in the footer to determine which row groups and column chunks it can safely ignore. This is called **predicate pushdown**.
* **Fetch Only What's Needed:** The engine seeks directly to the required column chunks and reads only that data, dramatically reducing I/O.

Let's inspect the file just written.

In [None]:
minimal_parquet = pq.ParquetFile(minimal_parquet_file)
inspect(minimal_parquet_file)

That is a lot of metadata for just one value -- much more than what you get from CSV or even JSON.

Notes:

* Pyarrow tries to discover the schema automatically from the source data.
* The metadata comes primarily from the Parquet specification, but Pyarrow adds properties. Use `store_schema=False` if you intend to write many small files.
* Not all of the structure is visible on Python API level (e.g., data page headers, dictionary content).

Try also looking at the file with [parqeye](https://github.com/kaushiksrini/parqeye):

```
parqeye minimal.parquet
```

Let's repeat this with a larger data set.

In [None]:
events_jsonl_path = Path('../data/input/events.jsonl')
events_data = read_jsonl(events_jsonl_path)

from random import choice
print("Random record from the data:")
display(choice(events_data))

print("Creating Parquet file with pyarrow schema discovery...")
events_table = pa.Table.from_pylist(events_data)
events_parquet_path = Path('../data/output/events.parquet')
pq.write_table(events_table, events_parquet_path)

parquet_mb = events_parquet_path.stat().st_size / (1024**2)
reduction = 100 * events_parquet_path.stat().st_size / events_jsonl_path.stat().st_size
print(f"Parquet file created, size {parquet_mb:.2f} MB, {reduction:.2f}% of original JSONL size.")

In [None]:
events_parquet = pq.ParquetFile(events_parquet_path)
inspect(events_parquet_path)

display(HTML("<h2>Sample data</h2>"))
display(events_table.to_pandas().head())

What can we see?

* There are two row groups for the data. PyArrow automatically splits into row groups of 1024^2 rows size.
* It auto-discovers types, but for JSON input, it mostly discovers string.
* It uses by default snappy compression, resulting in a file size that is slightly worse than you would just gzip the JSONL file.
* It writes by default dictionaries before the actual data pages. 
  * It collects all unique values of the column and writes them into a dictionary table. 
  * The data page contains instead of the values the index of the value to the dictionary table.
  * E.g., if the column has the value "abc", "def", "abc", "def", the dictionary will have the entries "abc" and "def" and the data page itself will have the values 0, 1, 0, 1 (conceptually; full details are [here](https://parquet.apache.org/docs/file-format/data-pages/encodings/)).

How would you model and encode the data? What are better options?

## Querying Parquet files

DuckDB is an embedded analytical database that can query Parquet files directly. Here are a few examples.

In [None]:
con = duckdb.connect()

query = f"""
SELECT COUNT(*) as total_events
FROM '{events_parquet_path}'
"""

result = con.execute(query).fetchdf()
display(result)

In [None]:
query = f"""
SELECT
    type,
    COUNT(*) as count,
    ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) as percentage
FROM '{events_parquet_path}'
GROUP BY type
ORDER BY count DESC
"""

result = con.execute(query).fetchdf()
print("Event type distribution:")
display(result)

In [None]:
query = f"""
SELECT
    source as device_id,
    COUNT(*) as event_count
FROM '{events_parquet_path}'
GROUP BY device_id
ORDER BY event_count DESC
"""

result = con.execute(query).fetchdf()
print("Top event producers:")
display(result)

What else could you calculate using the events? We will show more examples later.

## Summary 

In this section, we have shown how to easily write and query Parquet files. Parquet files use columnar storage, which is great for IoT analytics. They also contain quite some metadata, which allows a query engine to query the files in-place without necessarily reading the entire file. This is important if you have very large files and you need to transfer the files from an object store, as we will see later.

Parquet files can not only be manipulated by Python and DuckDB, but also from numerous other languages and tools. Try, for example, 
 
 * Browsing the files using the interactive Parquet view [parqeye](https://github.com/kaushiksrini/parqeye).
 * Reading and write Parquet files using [Go parquet-tools](https://github.com/hangxie/parquet-tools).
 * Writing Parquet files from the [Java reference implementation](https://github.com/apache/parquet-java) if you are a Java person.

In the next section, we dive deeper into the Parquet schemas, encodings and compressions.

In [None]:
con.close()
print("✓ Analysis complete!")