# Object stores and Parquet

Object stores (like AWS S3) is the standard for Data Lakes. Unlike local files, accessing data over the network has high latency. Parquet is designed for this environment. It allows query engines to only download the specific chunks of data they need (Column Pruning and Predicate Pushdown), drastically reducing network traffic.

## Setup 

Let's have a look how this works in more detail. We simulate an object storage using the `moto` AWS mocking library and a custom logger to print the requests that run over the network.

In [None]:
import os
import threading
import boto3
import daft
from werkzeug.serving import make_server, WSGIRequestHandler
from werkzeug.urls import uri_to_iri
from moto.server import DomainDispatcherApplication, create_backend_app

os.environ["DAFT_DASHBOARD_ENABLED"] = "0"
os.environ["DAFT_PROGRESS_BAR"] = "0"
BUCKET = "data-lake"

def custom_log_request(self, code='-', size='-'):
    path = uri_to_iri(self.path)
    if BUCKET in path:
        range_header = self.headers.get('Range', 'No range')
        code = str(code)
        print('ðŸ“¡ %04s [%24s] %s %s %s ' % (self.command, range_header, path, code, size))

WSGIRequestHandler.log_request = custom_log_request

def start_s3_server(port=5000):
    app = DomainDispatcherApplication(create_backend_app)
    #app = RangeHeaderLogger(app)

    server = make_server("127.0.0.1", port, app)
    thread = threading.Thread(target=server.serve_forever)
    thread.daemon = True
    thread.start()
    print(f"S3 Server running on port {port}")
    return server

server = start_s3_server()
s3 = boto3.client("s3", endpoint_url="http://127.0.0.1:5000", aws_access_key_id="fake", aws_secret_access_key="fake", region_name="us-east-1")
_ = s3.create_bucket(Bucket=BUCKET)

## Write to the object store

Just like in the previous section, we'll sort and write some measurements with fairly small row groups to demonstrate pruning. This time, we write to the mocked object store. The file itself is the same, so you can still inspect the file from the previous section and compare it with this output.

In [None]:
radiator_df = daft.read_json('../data/input/radiator.jsonl')
radiator_sorted_df = radiator_df.sort([daft.col("source")["value"], daft.col("time")])

io_config = daft.io.IOConfig(s3=daft.io.S3Config(endpoint_url="http://127.0.0.1:5000", anonymous=True))
daft.set_execution_config(parquet_target_row_group_size=4*1024*1024)
_ = radiator_sorted_df.write_parquet(f"s3://{BUCKET}/radiator", io_config=io_config, write_mode="overwrite")

## Query the object store

Next, we'll prepare a query on a particular device that we know is at the end of the sorting order (and hence at the end of the file). Note how during preparing the query, a negative byte range is used. A negative byte range indicates an offset from the end of the object. What is at the end of a Parquet file?

In [None]:
df_s3 = daft.read_parquet(f"s3://{BUCKET}/radiator", io_config=io_config)
df_s3 = df_s3.filter(daft.col("source")["value"] == "1822301").select(daft.col("meas_PressureBalance")["current_ForceValue"]["value"].alias("force"))

Now, let's actually run the query. Note that all requests read only a part of the file. What part is this? Inspect the previous section's file, e.g., using 

```
parquet-tools inspect --row-group 13 --column-chunk 22 â€¦.parquet| python3 -mjson.tool | less
```

(Columns chunk 2 is `source.value`, column chunk 22 is `meas_PressureBalance.current_ForceValue.value`.)

In [None]:
df_s3.collect()

The ranges shown here are actually reaching from the start of column `source.value` to the end of column `meas_PressureBalance.current_ForceValue.value` in each row group. Daft seems to be too daft (sorry for the pun) to understand from the min and max values of `source.value` that only row group 13 needs to be queried. 

## Query the object store, second try

However, if we flatten the `source` struct to a top-level column containing just the value, then sort by that value, the pruning actually works as expected.

In [None]:
radiator_flat_df = radiator_df.with_column("source_id", daft.col("source")["value"])
radiator_flat_sorted_df = radiator_flat_df.sort([daft.col("source_id"), daft.col("time")])
_ = radiator_flat_sorted_df.write_parquet(f"s3://{BUCKET}/radiator_flat", io_config=io_config, write_mode="overwrite")

In [None]:
df_flat = daft.read_parquet(f"s3://{BUCKET}/radiator_flat", io_config=io_config)
df_flat = df_flat.filter(daft.col("source_id") == "1822301").select(daft.col("meas_PressureBalance")["current_ForceValue"]["value"].alias("force"))

In [None]:
df_flat.collect()

In [None]:
server.shutdown()

## Review questions

**Why is network latency more important for object stores than local files?**
 - How does Parquet's design address this challenge?

**What does a negative byte range request mean?**
 - Why is this useful for reading Parquet files?

**Explain the sequence of HTTP requests when querying a Parquet file on S3.**
 - What gets fetched first? What can be skipped?

**How much data transfer can be saved with good predicate pushdown?**
 - Consider a 1GB file where you need 1% of rows from one partition.

**What are the challenges of using Parquet on object stores vs. local files?**
 - Think about consistency, listing operations, and metadata access.

## Challenges 

### Measure network efficiency

1. Run the same query on sorted vs. unsorted radiator data
2. Compare the total bytes transferred by examining byte ranges in requests
3. Calculate the percentage of the file that was actually read
4. What's the theoretical minimum bytes needed for this query?

### Multi-column filtering

1. Create queries that filter on multiple columns:
   - Filter by `source_id` AND `time` range
   - Filter by nested measurement fields
2. Observe which column chunks are fetched
3. Explain the access pattern - does Daft fetch columns separately or together?

### Partition simulation

1. Create a dataset partitioned by `source_id` using the `partition_cols` argument of `write_parquet`.
2. Compare S3 requests when querying:
   - One specific device (one partition)
   - All devices (all partitions)
3. What metadata operations are needed? How many files are read?

### Real object store comparison 

If you have AWS or Azure access:
1. Upload the sorted radiator data to real S3 or ADLS object stores.
2. Run the same queries as in the notebook
3. Compare latency with the mocked version
4. What's different? Consider:
   - Network latency
   - Service time
   - Throttling or rate limits


## Summary

Parquet is built for querying large files on object stores. However, the querying technology is still fairly new and not all query engines support the entire range of Parquet features yet.