Skip to content

Collection.validate(eager=True) slower than Collection.validate(eager=False).collect_all() #320

@gab23r

Description

@gab23r

Currently these two are not equivalent in terms of performance:

MyCollection.validate(data, eager=True)                    # Scans source N times
MyCollection.validate(data, eager=False).collect_all()     # Scans source once

When members share a common source (e.g., same parquet file), eager=True collects each member independently, causing duplicate scans. With eager=False + collect_all(), Polars can use common subplan elimination.

Is this behavior on purpose ?

MRE
import polars as pl
import dataframely as dy

scan_count = 0


def count_scans(s: pl.Series) -> pl.Series:
    global scan_count
    scan_count += 1
    return s


class A(dy.Schema):
    x = dy.Integer(primary_key=True)


class B(dy.Schema):
    x = dy.Integer(primary_key=True)
    y = dy.Integer()


class MyCollection(dy.Collection):
    a: dy.LazyFrame[A]
    b: dy.LazyFrame[B]


# Both members derive from same source with an "expensive" operation
source = pl.LazyFrame({"x": [1, 2, 3]}).with_columns(
    pl.col("x").map_batches(count_scans)
)
data = {
    "a": source,
    "b": source.with_columns(y=pl.col("x") * 2),
}

# Test 1: eager=True
scan_count = 0
MyCollection.validate(data, eager=True)
print(f"eager=True:                        {scan_count} scans")

# Test 2: eager=False + collect_all()
scan_count = 0
MyCollection.validate(data, eager=False).collect_all()
print(f"eager=False + collect_all():       {scan_count} scans")
# eager=True:                        3 scans
# eager=False + collect_all():       1 scans

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions