# Getting started with Parquet

This series introduces Apache Parquet:

* What is Parquet?
* Developing with Parquet
* Writing Parquet
* Inspecting Parquet files
* Querying Parquet files
* Summary

After going through the notebook, you should have basic understanding of what Apache Parquet is and how to read and write it from Python. You are encourage to follow the lines of code and to try to understanding what is happening in each line.

## What is Parquet?

Parquet is

* An [Open source file format](https://parquet.apache.org/) for column-oriented storage and bulk transfer of data under Apache governance.
* Readable and writable from [all mainstream programming languages and many database systems](https://parquet.apache.org/docs/file-format/implementationstatus/).
* Based on an innovative algorithm for seamless "shredding" and reassembling arbitrarily complex data structures with  nested structures into flat columns [originally from Google Research](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf).

Column-oriented storage is highly efficient for analytics. Consider this IoT example:

* A machine sends every second readings from 100 different sensors.
* You want to analyze the historical values of just **one** of those sensors.
* **Row-based formats** (imagine CSV) store each time-stamped reading as a complete row with all 100 sensor values. To analyze one sensor, the system must read all 100 columns.
* **Column-based formats** (like Parquet) group all values for a single sensor into a column. To analyze one sensor, the system only reads that specific column, dramatically reducing I/O.
* This also improves compression, as storing similar data together is more efficient.

![Row- vs column-oriented](../images/row-vs-column.png)

> **This makes Parquet a great match for IoT analytics.**

## Developing with Parquet in Python

The notebooks focus on usage from Python. For developing with Parquet in Python, multiple options are available. In this notebook, we chose [Daft](https://docs.daft.ai/en/stable/), a high-performance data processing library. We also use [PyArrow](https://arrow.apache.org/docs/python/) for a more detailed look into Parquet files plus [Pandas](https://pandas.pydata.org/) and a [graphing library](https://matplotlib.org/stable/users/index) for visualizing data.

There are also some helper functions for working with Parquet content included in this repository. Please ensure to install the libraries into your Python environment. E.g., in Visual Studio Code with Python supoprt enabled, you can select "Python: Create environment" and mark "requirements.txt" to install the libraries. Or use

```
pip install -r requirements.txt
```

In [None]:
import daft
import pyarrow.parquet as pq
from pathlib import Path
from IPython.display import HTML, display

%reload_ext autoreload
%autoreload 2
from helpers import inspect

## Writing Parquet files

What is the most simple way to write a Parquet file? 

* Create a DataFrame in memory.
* Write the DataFrame to a Parquet file.

Let's write a table with just a single row having a single column.

In [None]:
tiny_df = daft.from_pydict({'value': [42]})

files = tiny_df.write_parquet('../data/output/minimal', write_mode='overwrite') # Daft writes a unique file to the given directory.
minimal_parquet_file = files.to_pydict()['path'][0]

## Inspecting Parquet files

What is actually inside a Parquet file? 

* Data is horizontally partitioned into larger **row groups** (e.g., several MBs).
* Within each row group, data is stored column-by-column in **column chunks**.
* Each column chunk is divided into **data pages**, which is where data is encoded and compressed.
* A **file footer** contains all metadata, including the schema and statistics (like min/max values) for each column chunk. This allows query engines to skip reading unnecessary data.

```
Parquet file
└─ Magic number
└─ Row group (horizontal partitioning)
   └─ Column chunk (vertical - one per column)
      └─ Data page (optional dictionary page followed by actual data pages)
         ├─ Page header
         └─ Actual compressed/encoded data
└─ File footer
   └─ Metadata
   └─ Footer length
   └─ Magic number
```

![Parquet file layout](../images/FileLayout.gif)

Parquet's read performance is enabled by a "footer-first" approach. Here's how a query engine efficiently finds data without scanning the whole file:

* **Read the Footer:** The engine reads the small footer at the end of the file to get a "map" of its contents. 
* **Use Statistics to Skip Data:** It consults the metadata and statistics (like min/max values) in the footer to determine which row groups and column chunks it can safely ignore. This is called **predicate pushdown**.
* **Fetch Only What's Needed:** The engine seeks directly to the required column chunks and reads only that data, dramatically reducing I/O.

Let's inspect the file just written.

In [None]:
inspect(minimal_parquet_file)

That is a lot of metadata for just one value -- much more than what you get from CSV or even JSON.

Note: Not all of the structure is visible on Python API level (e.g., data page headers, dictionary content), but try looking at the file with [parqeye](https://github.com/kaushiksrini/parqeye):

```
parqeye minimal.parquet
```

Check "Row Groups" and the "value" column. What do you notice? 

Let's repeat this with a larger data set.

In [None]:
# Read a large file with events
events_jsonl_path = Path('../data/input/events.jsonl')
events_df = daft.read_json(str(events_jsonl_path))

# Print a sample of the data
events_df.sample(size = 1).show()

# Write to Parquet
files = events_df.write_parquet('../data/output/events', write_mode='overwrite')
events_parquet_path = Path(files.to_pydict()['path'][0])

# Compare with the original JSONL file
parquet_mb = events_parquet_path.stat().st_size / (1024**2)
reduction = 100 * events_parquet_path.stat().st_size / events_jsonl_path.stat().st_size
print(f"Parquet file created, size {parquet_mb:.2f} MB, {reduction:.2f}% of original JSONL size.")

inspect(events_parquet_path)

In [None]:
events_df.show(5)

What can we see?

* Daft auto-discovers types. Even though "time" and "creationTime" are just strings in the input, it finds that these are actually timestamps and converts the type accordingly.
* It uses by default snappy compression, resulting very efficient compression and decompression, but a file size that is slightly worse than if you would just gzip the JSONL file.
* It writes by default dictionaries before the actual data pages. 
  * It collects all unique values of the column and writes them into a dictionary table. 
  * The data page contains instead of the values the index of the value to the dictionary table.
  * E.g., if the column has the value "abc", "def", "abc", "def", the dictionary will have the entries "abc" and "def" and the data page itself will have the values 0, 1, 0, 1 (conceptually; full details are [here](https://parquet.apache.org/docs/file-format/data-pages/encodings/)).

How would you model and encode the data? Can you think of better options?

## Querying Parquet files

Daft can query Parquet files using expressions or even SQL. Here are a few examples.

In [None]:
daft.sql(f"""
    SELECT COUNT(*) as total_events FROM events_df
""").show()

In [None]:
print("Event type distribution:")
daft.sql(f"""
    SELECT
        type,
        COUNT(*) as count
    FROM events_df
    GROUP BY type
    ORDER BY count DESC
""").show()

In [None]:
print("Top event producers:")
daft.sql(f"""
    SELECT source as device_id, COUNT(*) as event_count
    FROM events_df
    GROUP BY device_id
    ORDER BY event_count DESC
""").show()

## Review questions

**Why is columnar storage better for analytics than row-based storage?**
 - Think about the IoT sensor example - when would row-based storage actually be better?

**What makes Parquet's "footer-first" approach efficient for large files?**
 - How does it avoid reading unnecessary data?

**What is the purpose of row groups in Parquet?**
 - Why not just have one giant row group or many tiny ones?

**How does predicate pushdown work with Parquet metadata?**
 - What statistics are needed to skip reading data?

**Compare Parquet's compression ratio to JSONL.**
 - Is Parquet always smaller? What factors affect compression?

## Challenges

### Inspect different file structures

1. Create three versions of the events data with different characteristics, just a few events, all of the events, multiple times the events replicated to create millions of rows. 
2. Inspect each file and compare:
   - Number of row groups and data pages inside the row groups created
   - Row group and data page sizes.
   - File overhead (metadata size vs. data size)
3. What patterns do you notice as file size increases?

Try to do this in your programming language of choise, if it is not Python.

### Query performance analysis

1. Write queries that select:
   - All columns for a few rows
   - One column for all rows
   - Aggregations (COUNT, SUM) on specific columns
2. Measure execution time for each query type
3. Which query patterns benefit most from columnar storage?

### Explore cmdata.jsonl

1. Read `cmdata.jsonl` and write it as Parquet
2. Inspect the resulting file structure
3. Query for devices of a specific type
4. Calculate the compression ratio compared to the original JSONL


## Summary 

In this section, we have shown how to easily write and query Parquet files. Parquet files use columnar storage, which is great for IoT analytics. They also contain quite some metadata, which allows a query engine to query the files in-place without necessarily reading the entire file. This is important if you have very large files and you need to transfer the files from an object store, as we will see later.

Parquet files can not only be manipulated by Daft and PyArrow, but also from numerous other languages and tools. Try, for example, 
 
 * Browsing the files using the interactive Parquet viewer [parqeye](https://github.com/kaushiksrini/parqeye).
 * Reading and write Parquet files using [Go parquet-tools](https://github.com/hangxie/parquet-tools).
 * Writing Parquet files from the [Java reference implementation](https://github.com/apache/parquet-java) if you are a Java person.

In the next section, we dive deeper into the Parquet schemas, encodings and compressions.