# Object Storage and Parquet

Object storage (like AWS S3) is the standard for Data Lakes. Unlike local files, accessing data over the network has high latency. Parquet is designed for this environment. It allows query engines to only download the specific chunks of data they need (Column Pruning and Predicate Pushdown), drastically reducing network traffic.

Let's have a look how this works in more detail. We simulate an object storage using the `moto` AWS mocking library and a custom logger to print the requests that run over the network.

In [None]:
import os
import threading
import boto3
import daft
from werkzeug.serving import make_server, WSGIRequestHandler
from werkzeug.urls import uri_to_iri
from moto.server import DomainDispatcherApplication, create_backend_app
from IPython.display import Markdown


os.environ["DAFT_DASHBOARD_ENABLED"] = "0"
BUCKET = "data-lake"


def custom_log_request(self, code='-', size='-'):
    path = uri_to_iri(self.path)
    if BUCKET in path:
        range_header = self.headers.get('Range', 'No range')
        code = str(code)
        print('ðŸ“¡ %04s [%24s] %s %s %s ' % (self.command, range_header, path, code, size))

WSGIRequestHandler.log_request = custom_log_request

def start_s3_server(port=5000):
    app = DomainDispatcherApplication(create_backend_app)
    #app = RangeHeaderLogger(app)

    server = make_server("127.0.0.1", port, app)
    thread = threading.Thread(target=server.serve_forever)
    thread.daemon = True
    thread.start()
    print(f"S3 Server running on port {port}")
    return server

server = start_s3_server()
s3 = boto3.client("s3", endpoint_url="http://127.0.0.1:5000", aws_access_key_id="fake", aws_secret_access_key="fake", region_name="us-east-1")
_ = s3.create_bucket(Bucket=BUCKET)

Just like in the previous section, we'll sort and write some measurements with fairly small row groups to demonstrate pruning. This time, we write to the mocked object store. The file itself is the same, so you can still inspect the file from the previous section and compare it with this output.

In [None]:
radiator_df = daft.read_json('../data/input/radiator.jsonl')
radiator_sorted_df = radiator_df.sort([daft.col("source")["value"], daft.col("time")])

io_config = daft.io.IOConfig(s3=daft.io.S3Config(endpoint_url="http://127.0.0.1:5000", anonymous=True))
daft.set_execution_config(parquet_target_row_group_size=4*1024*1024)
_ = radiator_sorted_df.write_parquet(f"s3://{BUCKET}/radiator", io_config=io_config, write_mode="overwrite")

Next, we'll prepare a query on a particular device that we know is at the end of the sorting order (and hence at the end of the file). Note how during preparing the query, a negative byte range is used. A negative byte range indicates an offset from the end of the object. What is at the end of a Parquet file?

In [None]:
df_s3 = daft.read_parquet(f"s3://{BUCKET}/radiator", io_config=io_config)
df_s3 = df_s3.filter(daft.col("source")["value"] == "1822301").select(daft.col("meas_PressureBalance")["current_ForceValue"]["value"].alias("force"))

Now, let's actually run the query. Note that all requests read only a part of the file. What part is this? Inspect the previous section's file, e.g., using 

```
parquet-tools inspect --row-group 13 --column-chunk 22 â€¦.parquet| python3 -mjson.tool | less
```

(Columns chunk 2 is `source.value`, column chunk 22 is `meas_PressureBalance.current_ForceValue.value`.)

In [None]:
df_s3.collect()


The ranges shown here are actually reaching from the start of column `source.value` to the end of column `meas_PressureBalance.current_ForceValue.value` in each row group. Daft seems to be too daft (sorry for the pun) to understand from the min and max values of `source.value` that only row group 13 needs to be queried. 

In [None]:
from io import StringIO
import re

def simplify_explain(explain_output):
    """Simplify explain output by removing verbose schema definitions"""
    lines = explain_output.split('\n')
    result = []
    i = 0

    while i < len(lines):
        line = lines[i]

        # Check if this is an IO config line
        if '   IO config = ' in line:
            result.append(re.sub(r'(IO config = ).*', r'\1<...>', line))
            i += 1
            # Skip all heavily indented continuation lines
            while i < len(lines) and (lines[i].startswith('|     ') or lines[i].startswith('    ')):
                i += 1
            continue

        # Check if this is a File/Output schema line
        if re.search(r'\|   (File schema|Output schema) = ', line):
            result.append(re.sub(r'((File schema|Output schema) = ).*', r'\1<...>', line))
            i += 1
            # Skip continuation lines (more indented)
            while i < len(lines) and (lines[i].startswith('|     ') or lines[i].startswith('    ')):
                i += 1
            continue

        # Check if this is Schema: {...} (in Physical Plan)
        if '   Schema: ' in line:
            result.append(re.sub(r'(Schema: ).*', r'\1<...>', line))
            i += 1
            # Skip continuation lines
            while i < len(lines) and (lines[i].startswith('|     ') or lines[i].startswith('    ')):
                i += 1
            continue

        # Skip these specific lines entirely
        if 'Coerce int96' in line or 'Use multithreading' in line:
            i += 1
            continue

        result.append(line)
        i += 1

    return '\n'.join(result)

buffer = StringIO()
df_s3.explain(show_all=True, file=buffer)
plan = buffer.getvalue()
plan = simplify_explain(plan)
print(plan)

In [None]:
server.shutdown()