# Distributed ML model batch inference on data in DeltaLake

In this tutorial, we showcase how to perform ML model batch inference on data in a DeltaLake table.

This is a continuation of the previous tutorial on **local** batch inference, which is a great way to get started and make sure that your code is working before graduating to larger scales in a distributed batch inference workload. Make sure to give that a read before looking at this tutorial!

To run this tutorial you will require AWS credentials to be correctly provisioned on your machine as all data is hosted in a requestor-pays bucket in AWS S3.

Let's get started!

In [None]:
CI = False

In [None]:
# Skip this notebook execution in CI because it hits non-public buckets
if CI:
    import sys
    sys.exit()

# Going Distributed

The first step (and most important for this demo!) is to switch our Daft runner to the Ray runner, and point it at a Ray cluster. This is super simple:

In [1]:
import daft

# If you have your own Ray cluster running, feel free to set this to that address!
# RAY_ADDRESS = "ray://localhost:10001"
RAY_ADDRESS = None

daft.context.set_runner_ray(address=RAY_ADDRESS)

DaftContext(_daft_execution_config=<daft.daft.PyDaftExecutionConfig object at 0x1039afc90>, _daft_planning_config=<daft.daft.PyDaftPlanningConfig object at 0x1039afc10>, _runner_config=_RayRunnerConfig(address=None, max_task_backlog=None), _disallow_set_runner=True, _runner=None)

Now, we run the same operations as before. The only difference is that instead of execution happening locally on the machine that's running this code, Daft will distribute the computation over your Ray cluster!

In [2]:
# Feel free to tweak this variable to have the tutorial run on as many rows as you'd like!
NUM_ROWS = 1000

### Retrieving data

We will be retrieving the data exactly the same way we did in the previous tutorial, with the same API and arguments.

In [3]:
# Provision Cloud Credentials
import boto3
import daft

session = boto3.session.Session()
creds = session.get_credentials()
io_config = daft.io.IOConfig(
    s3=daft.io.S3Config(
        access_key=creds.secret_key,
        key_id=creds.access_key,
        session_token=creds.token,
        region_name="us-west-2",
    )
)

# Retrieve data
df = daft.read_deltalake("s3://daft-public-datasets/imagenet/val-10k-sample-deltalake/", io_config=io_config)

# Prune data
df = df.limit(NUM_ROWS)
df = df.where(df["object"].list.lengths() == 1)

### Splitting the data into more partitions

We now split the data into more partitions for additional parallelism when performing our data processing in a **distributed** fashion

In [4]:
df = df.into_partitions(16)

### Retrieving the images and preprocessing

Now we continue with exactly the same code as in the local case for retrieving and preprocessing our images

In [5]:
# Retrieve images and run preprocessing
df = df.with_column(
    "image_url",
    "s3://daft-public-datasets/imagenet/val-10k-sample-deltalake/images/" + df["filename"] + ".jpeg"
)
df = df.with_column("image", df["image_url"].url.download().image.decode())
df = df.with_column("image_resized_small", df["image"].image.resize(32, 32))
df = df.with_column("image_resized_large", df["image"].image.resize(256, 256))

### Running batch inference with a UDF

Running the UDF is also exactly the same!

In [6]:
# Run batch inference over the entire dataset
import daft
import numpy as np
import torch
from torchvision.models import resnet50, ResNet50_Weights

@daft.udf(return_dtype=daft.DataType.string())
class ClassifyImage:
    def __init__(self):
        weights = ResNet50_Weights.DEFAULT
        self.model = resnet50(weights=weights)
        self.model.eval()
        self.preprocess = weights.transforms()
        self.category_map = weights.meta["categories"]

    def __call__(self, images: daft.Series, shape: list[int, int, int]):
        if len(images) == 0:
            return []

        # Convert the Daft Series into a list of Numpy arrays
        data = images.cast(daft.DataType.tensor(daft.DataType.uint8(), tuple(shape))).to_pylist()

        # Convert the numpy arrays into a torch tensor
        images_array = torch.tensor(np.array(data)).permute((0, 3, 1, 2))

        # Run the model, and map results back to a human-readable string
        batch = self.preprocess(images_array)
        prediction = self.model(batch).softmax(0)
        class_ids = prediction.argmax(1)
        scores = prediction[:, class_ids]
        return [self.category_map[class_id] for class_id in class_ids]

# Filter out rows where the channel != 3
df = df.where(df["image"].apply(lambda img: img.shape[2] == 3, return_dtype=daft.DataType.bool()))

df = df.with_column("predictions_lowres", ClassifyImage(df["image_resized_small"], [32, 32, 3]))
df = df.with_column("predictions_highres", ClassifyImage(df["image_resized_large"], [256, 256, 3]))

# Prune the results and write data back out as Parquet
df = df.select(
    "filename",
    "image_url",
    "object",
    "predictions_lowres",
    "predictions_highres",
)
df.write_parquet("my_results.parquet")

2024-03-29 19:38:18,040	INFO worker.py:1642 -- Started a local Ray instance.


ScanWithTask-LocalLimit-LocalLimit-Project-Filter [Stage:3]:   0%|          | 0/1 [00:00<?, ?it/s]

FanoutSlices [Stage:2]:   0%|          | 0/1 [00:00<?, ?it/s]

Project-Project-Filter-Project-WriteFile [Stage:1]:   0%|          | 0/1 [00:00<?, ?it/s]

path Utf8
my_results.parquet/8eb54f00-9537-4e28-ac85-e96a00a071d5-0.parquet
my_results.parquet/04ccf8fe-9777-4307-9e1f-916c8532ca1c-0.parquet
my_results.parquet/867fc77f-f730-4b53-8e9a-11ed5dc9b98f-0.parquet
my_results.parquet/e4645f7b-8a70-4ee8-8221-823777467a0a-0.parquet
my_results.parquet/dd41fced-6e6b-4ece-8e58-d0804311b4ff-0.parquet
my_results.parquet/c548e6f4-3c83-4f76-b7c5-821f81157720-0.parquet
my_results.parquet/28753019-9875-45a2-94b4-b7b9217492ca-0.parquet
my_results.parquet/f66ffaa6-cc2e-4328-8137-aa358244a8a3-0.parquet


# Now, take a look at your handiwork!

Let's read the results of our distributed Daft job!

In [None]:
daft.read_parquet("my_results.parquet").collect()