# Object Storage and Parquet

Object storage (like AWS S3) is the standard for Data Lakes. Unlike local files, accessing data over the network has high latency. Parquet is designed for this environment. It allows query engines to only download the specific chunks of data they need (Column Pruning and Predicate Pushdown), drastically reducing network traffic.

Let's have a look how this works in more detail. We simulate an object storage using the `moto` AWS mocking library and a custom logger to print the requests that run over the network.

In [None]:
import os
import threading
import boto3
import daft
from werkzeug.serving import make_server, WSGIRequestHandler
from werkzeug.urls import uri_to_iri
from moto.server import DomainDispatcherApplication, create_backend_app

os.environ["DAFT_DASHBOARD_ENABLED"] = "0"
BUCKET = "data-lake"


def custom_log_request(self, code='-', size='-'):
    path = uri_to_iri(self.path)
    if BUCKET in path:
        range_header = self.headers.get('Range', 'No range')
        code = str(code)
        print('ðŸ“¡ %04s [%24s] %s %s %s ' % (self.command, range_header, path, code, size))

WSGIRequestHandler.log_request = custom_log_request

def start_s3_server(port=5000):
    app = DomainDispatcherApplication(create_backend_app)
    #app = RangeHeaderLogger(app)

    server = make_server("127.0.0.1", port, app)
    thread = threading.Thread(target=server.serve_forever)
    thread.daemon = True
    thread.start()
    print(f"S3 Server running on port {port}")
    return server

server = start_s3_server()
s3 = boto3.client("s3", endpoint_url="http://127.0.0.1:5000", aws_access_key_id="fake", aws_secret_access_key="fake", region_name="us-east-1")
_ = s3.create_bucket(Bucket=BUCKET)

In [None]:
radiator_df = daft.read_json('../data/input/radiator.jsonl')
radiator_sorted_df = radiator_df.sort([daft.col("source")["value"], daft.col("time")])

io_config = daft.io.IOConfig(s3=daft.io.S3Config(endpoint_url="http://127.0.0.1:5000", anonymous=True))
_ = radiator_sorted_df.write_parquet(f"s3://{BUCKET}/radiator", io_config=io_config, write_mode="overwrite")

### Column Pruning vs Full Scan

Watch the output above when we run the next cell.

1.  **Metadata Read:** You will see a request for the file footer (usually the last few KB of the file).
2.  **Column Read**: You will see specific `Range: bytes=...` requests for the column chunks we selected.
3.  **No Full Download:** You will *not* see a request for the full file size.

In [None]:
df_s3 = daft.read_parquet(f"s3://{BUCKET}/radiator", io_config=io_config)
df_s3 = df_s3.where(daft.col("source")["value"] == "1822301").select(daft.col("meas_Load_1")["current_ForceValue"]["value"].alias("force")).collect()

In [None]:
# Cleanup (Optional - kernel restart also kills it)
server.shutdown()