# Object stores and Parquet

Object stores (like AWS S3) are the standard for Data Lakes. Unlike local files, accessing data over the network has high latency. Parquet is designed for this environment. It allows query engines to only download the specific chunks of data they need (Column Pruning and Predicate Pushdown), drastically reducing network traffic.

## Setup 

Let's have a look how this works in more detail. We simulate an object storage using the `moto` AWS mocking library and a custom logger to print the requests that run over the network.

In [None]:
%reload_ext autoreload
%autoreload 2
import os
import daft
from s3simulator import S3Simulator

os.environ["DAFT_DASHBOARD_ENABLED"] = "0"
os.environ["DAFT_PROGRESS_BAR"] = "0"
BUCKET = "data-lake"

s3_sim = S3Simulator(bucket_name=BUCKET, port=5000)
s3_sim.start()

## Write to the object store

Just like in the previous section, we'll sort and write some measurements with fairly small row groups to demonstrate pruning. This time, we write to the mocked object store. The file itself is the same, so you can still inspect the file from the previous section and compare it with this output.

In [None]:
radiator_df = daft.read_json('../data/input/radiator.jsonl')
radiator_sorted_df = radiator_df.sort([daft.col("source")["value"], daft.col("time")])

io_config = daft.io.IOConfig(s3=daft.io.S3Config(endpoint_url="http://127.0.0.1:5000", anonymous=True))
daft.set_execution_config(parquet_target_row_group_size=4*1024*1024)
_ = radiator_sorted_df.write_parquet(f"s3://{BUCKET}/radiator", io_config=io_config, write_mode="overwrite")

## Query the object store

Next, we'll prepare a query on a particular device that we know is at the end of the sorting order (and hence at the end of the file). Note how during preparing the query, a negative byte range is used. A negative byte range indicates an offset from the end of the object. What is at the end of a Parquet file?

In [None]:
df_s3 = daft.read_parquet(f"s3://{BUCKET}/radiator", io_config=io_config)
df_s3 = df_s3.filter(daft.col("source")["value"] == "1822301").select(daft.col("meas_PressureBalance")["current_ForceValue"]["value"].alias("force"))

Now, let's actually run the query. Note that all requests read only a part of the file. What part is this? Inspect the previous section's file, e.g., using 

```
parquet-tools inspect --row-group 13 --column-chunk 22 â€¦.parquet| python3 -mjson.tool | less
```

(Columns chunk 2 is `source.value`, column chunk 22 is `meas_PressureBalance.current_ForceValue.value`.)

In [None]:
df_s3.collect()

The ranges shown here are actually reaching from the start of column `source.value` to the end of column `meas_PressureBalance.current_ForceValue.value` in each row group. Daft seems to be too daft (sorry for the pun) to understand from the min and max values of `source.value` that only row group 13 needs to be queried. 

## Query the object store, second try

However, if we flatten the `source` struct to a top-level column containing just the value, then sort by that value, the pruning actually works as expected.

In [None]:
radiator_flat_df = radiator_df.with_column("source_id", daft.col("source")["value"])
radiator_flat_sorted_df = radiator_flat_df.sort([daft.col("source_id"), daft.col("time")])
_ = radiator_flat_sorted_df.write_parquet(f"s3://{BUCKET}/radiator_flat", io_config=io_config, write_mode="overwrite")

In [None]:
df_flat = daft.read_parquet(f"s3://{BUCKET}/radiator_flat", io_config=io_config)
df_flat = df_flat.filter(daft.col("source_id") == "1822301").select(daft.col("meas_PressureBalance")["current_ForceValue"]["value"].alias("force"))

In [None]:
df_flat.collect()

In [None]:
s3_sim.stop()

## Summary

Parquet is built for querying large files on object stores. However, even though Parquet itself is mature, many query engines on top of Parquet are recent and subject to active development. They may not support the entire range of Parquet features yet. 

This section concludes the Parquet part. The next part covers the Iceberg table standard built on top of Parquet.