# Top N Most Red Images

In this demo we will go through a simple example of using Daft to find the top N most red images out of the OpenImages dataset.

In [None]:
distributed = False

In [None]:
###
# Settings for running on 10,000 rows in a distributed Ray cluster
###

if distributed:
    import daft.context
    import ray

    ray.init(
        address="ray://localhost:10001",
        runtime_env={"pip": ["getdaft", "pillow", "s3fs"]},
    )

    daft.context.set_runner_ray(address="ray://localhost:10001")


## Constructing our DataFrame

In [None]:
from daft import DataFrame, lit

df = DataFrame.from_files("s3://daft-public-data/open-images/validation-images/*")

if distributed:
    df = df.limit(10000)
    df = df.repartition(64)
else:
    df = df.limit(100)

In [None]:
df

### Filtering Data

In [None]:
df = df.where(df["size"] < 300000)

**Daft is LAZY**: this filtering doesn't run immediately, but rather queues operations up in a query plan.

Now, let's define another filter with a lower bound for the image size

In [None]:
df = df.where(df["size"] > 200000)

If we look at the plan, there are now two enqueued `Filter` operations!

In [None]:
df.explain()

Doing these two `Filter`s one after another is really inefficient since we have to pass through the data twice!

Don't worry though - Daft's query optimizer will actually optimize this at runtime and merge the two `Filter` operations into one. You can view the optimized plan with `show_optimized=True`:

In [None]:
df.explain(show_optimized=True)

This is just one example of query optimization, and Daft does many other really important ones such as Predicate Pushdowns and Column Pruning to keep your execution plans running efficiently.

Now we can **materialize** the filtered dataframe like so:

In [None]:
# Materializes the dataframe and shows first 10 rows

df.collect()

Note that calling `.collect()` **materializes** the data. It will execute the above plan and all the computed data will be materialized in memory as long as the `df` variable is valid. This means that any subsequent operation on `df` will read from this materialized data instead of computing the entire plan again.

Let's prune the columns to just the "name" column, which is the only one we need at the moment:

In [None]:
df = df.select("name")

In [None]:
# Show doesn't materialize the data, but lets us peek at the first N rows
# produced by the current query plan

df.show(5)

### Working with Complex Data

Now let's do some data manipulation, starting with some simple ones (URLs) and finally images!

In [None]:
df = df.with_column("image", lit("s3://").str.concat(df["name"]).url.download())

In [None]:
# Materialize the dataframe, so that we don't have to hit S3 again for subsequent operations
df.collect()

To load the raw bytes we're downloading from the URL into a `PIL` image, we can define a simple function and run it on our column:

In [None]:
from daft import udf

import io
import PIL

@udf(num_cpus=2, return_type=PIL.Image.Image)
def bytes_to_pil(bytes_column):
    return [
        PIL.Image.open(io.BytesIO(data)).resize((256, 256)) for data in bytes_column
    ]

df = df.with_column("image", bytes_to_pil(df["image"]))

In [None]:
df.show(5)

In [None]:
from PIL import ImageFilter
import numpy as np


def magic_red_detector(img: PIL.Image.Image) -> PIL.Image.Image:
    """Gets a new image which is a mask covering all 'red' areas in the image"""
    lower = np.array([245, 100, 100])
    upper = np.array([10,255,255])
    lower_hue, upper_hue = lower[0, np.newaxis, np.newaxis], upper[0, np.newaxis, np.newaxis]
    lower_saturation_intensity, upper_saturation_intensity = lower[1:, np.newaxis, np.newaxis], upper[1:, np.newaxis, np.newaxis]    
    hsv = img.convert('HSV')
    hsv = np.asarray(hsv).T
    mask = np.all((hsv[1:, ...] >= lower_saturation_intensity) & (hsv[1:, ...] <= upper_saturation_intensity), axis=0) & ((hsv[0, ...] >= lower_hue) | (hsv[0, ...] <= upper_hue))
    img = PIL.Image.fromarray(mask.T)
    img = img.filter(ImageFilter.ModeFilter(size=5))
    return img


df = df.with_column(
    "red_mask",
    df["image"].apply(magic_red_detector),
)

In [None]:
df.collect()

In [None]:
import numpy as np

def sum_mask(mask: PIL.Image.Image) -> int:
    val = np.asarray(mask).sum()
    return int(val)

df = df.with_column(
    "num_pixels_red",
    df["red_mask"].apply(sum_mask),
)

In [None]:
df.sort("num_pixels_red", desc=True).collect()