Quantco · borchero · Oct 27, 2025 · Oct 25, 2025 · Oct 25, 2025 · Oct 25, 2025
@@ -956,11 +956,6 @@ def scan_parquet(
             ValueError: If the provided directory does not contain parquet files for
                 all required members.
 
-        Note:
-            Due to current limitations in dataframely, this method actually reads the
-            parquet file into memory if `"validation"` is `"warn"` or `"allow"`
-            and validation is required.
-
         Attention:
             Be aware that this method suffers from the same limitations as
             :meth:`serialize`.
@@ -1049,9 +1044,6 @@ def scan_delta(
             ValueError:
                 If the provided source does not contain Delta tables for all required members.
 
-        Note:
-            Due to current limitations in dataframely, this method may read the Delta table into memory if `validation` is `"warn"` or `"allow"` and validation is required.
-
         Attention:
             Schema metadata is stored as custom commit metadata. Only the schema
             information from the last commit is used, so any table modifications

@@ -0,0 +1,13 @@
+:html_theme.sidebar_secondary.remove: true
+
+.. role:: hidden
+
+{{ name | underline }}
+
+.. currentmodule:: {{ module }}
+
+.. autoclass:: {{ name }}
+    :members:
+    :exclude-members: add_note, with_traceback
+    :autosummary:
+    :autosummary-nosignatures:
@@ -0,0 +1,15 @@
+======
+Errors
+======
+
+.. currentmodule:: dataframely
+.. autosummary::
+    :toctree: _gen/
+    :template: classes/error.rst
+    :nosignatures:
+
+    ~exc.SchemaError
+    ~exc.ValidationError
+    ~exc.ImplementationError
+    ~exc.AnnotationImplementationError
+    ~exc.ValidationRequiredError
@@ -25,6 +25,15 @@ API Reference
 
            columns/index
 
+.. grid::
+
+    .. grid-item-card::
+
+        .. toctree::
+           :maxdepth: 1
+
+           errors/index
+
     .. grid-item-card::
 
         .. toctree::

@@ -174,7 +174,9 @@ def hide_class_signature(
     return_annotation: str,
 ) -> tuple[str, str] | None:
     if what == "class" and (
-        name.endswith("FilterResult") or name.endswith("FailureInfo")
+        name.endswith("FilterResult")
+        or name.endswith("FailureInfo")
+        or name.endswith("AnnotationImplementationError")
     ):
         # Return empty signature (no args after the class name)
         return "", return_annotation

@@ -29,3 +29,20 @@ class UserSchema(dy.Schema):
         """Email must be unique, if provided."""
         return pl.col("email").is_null() | pl.col("email").is_unique()
 ```
+
+## How do I fix the ruff error `First argument of a method should be named self`?
+
+If you are using [`ruff`](https://docs.astral.sh/ruff/) and introduce custom rules for your schemas, `ruff` will create
+the following linting error:
+
+```
+N805 First argument of a method should be named `self`
+```
+
+To fix this, you'll need to let `ruff` know that the `@dy.rule` decorator is applied to classmethods. This can easily
+be done by adding the following to your `pyproject.toml`:
+
+```toml
+[tool.ruff.lint.pep8-naming]
+classmethod-decorators = ["dataframely.rule"]
+```
@@ -8,4 +8,5 @@ data-generation
 primary-keys
 serialization
 sql-generation
+lazy-validation
 ```
@@ -0,0 +1,41 @@
+# Lazy Validation
+
+In many cases, dataframely's capability to validate and filter input data is used at core application boundaries.
+As a result, `validate` and `filter` are generally expected to be used at points where `collect` is called on a lazy
+frame. However, there may be situations where validation or filtering should simply be added to the lazy computation
+graph. Starting in dataframely v2, this is supported via a custom polars plugin.
+
+## The `eager` parameter
+
+All of the following methods expose an `eager: bool` parameter:
+
+- {meth}`Schema.validate() <dataframely.Schema.validate>`
+- {meth}`Schema.filter() <dataframely.Schema.filter>`
+- {meth}`Collection.validate() <dataframely.Collection.validate>`
+- {meth}`Collection.filter() <dataframely.Collection.filter>`
+
+By default, `eager=True`. However, users may decide to set `eager=False` in order to simply append the validation or
+the filtering operation to the lazy frame. For example, one might decide to run validation lazily:
+
+```python
+def validate_lf(lf: pl.LazyFrame) -> pl.LazyFrame:
+    return lf.pipe(MySchema.validate, eager=False)
+```
+
+When `eager=False`, validation is only run once the lazy frame is collected. If input data does not satisfy the schema,
+no error is raised here, yet.
+
+## Error Types
+
+Due to current limitations in polars plugins, the type of error that is being raised from the `validate` function (both
+for schemas and collections) is dependent on the value of the `eager` parameter:
+
+- When `eager=True`, a {class}`~dataframely.ValidationError` is raised from the `validate` function
+- When `eager=False`, a {class}`~polars.exceptions.ComputeError` is raised from the `collect` function
+
+```{note}
+For schemas, the error _message_ itself is equivalent.
+For collections, the error message for `eager=False` is limited and non-deterministic: the error message only includes
+information about a single member and, if multiple members fail validation, the member that the error message refers to
+may vary across executions.
+```
@@ -8,7 +8,7 @@ quickstart
 examples/index
 features/index
 development
-versioning
+migration/index
 faq
 ```
 

@@ -1,4 +1,13 @@
-# Versioning policy and breaking changes
+# Migration Guides
+
+```{toctree}
+:maxdepth: 1
+:hidden:
+
+v1-v2
+```
+
+## Versioning policy and breaking changes
 
 Dataframely uses [semantic versioning](https://semver.org/).
 This versioning scheme is designed to make it easy for users to anticipate what types of change they can expect from a

@@ -0,0 +1,161 @@
+# Migrating from v1 to v2
+
+Dataframely v2 introduces several improvements and some breaking changes to streamline the API.
+
+## Improvements
+
+### Lazy Validation
+
+Dataframely v2 finally implements lazy validation and filtering using a custom polars plugin. This allows
+{meth}`Schema.validate` and {meth}`Schema.filter` to be used within lazy computation graphs instead of forcing a
+`collect`. More details can be found in the [dedicated guide](../features/lazy-validation.md).
+
+### Lazy `scan` operations
+
+With lazy validation, all `scan_*` methods (e.g. {meth}`Schema.scan_parquet`, {meth}`Collection.scan_delta`, ...) are
+now truly lazy, even if validation is necessary. Previously, this required collecting the input and running validation
+eagerly.
+
+### S3 Support in all I/O functions
+
+Dataframely v2 now properly supports S3 for all I/O functions (i.e. `write_*`, `sink_*`, `read_*`, `scan_*`).
+
+## Breaking Changes
+
+### Columns are non-nullable by default
+
+In dataframely v1, specifying a column without setting the `nullable` property caused the column to be nullable and a
+warning was emitted. In dataframely v2, this changes: no warning is emitted anymore and `nullable` defaults to `False`.
+This mirrors the typical expectation that a column is not nullable (because `null` values often indicate issues) --
+nullability now becomes opt-in.
+
+### Primary key columns may not be nullable
+
+While dataframely v1 merely emitted a warning, dataframely v2 now raises an exception if a primary key is designated
+as non-nullable. This aligns dataframely, for example, with SQL where primary key columns may not be nullable.
+
+### Schema rules are now defined as classmethods
+
+In order to allow schema rules to access information about the schema and, especially, information of a schema's
+subclasses, schema rules must now be specified as classmethods. This means:
+
+```python
+class MySchema(dy.Schema):
+    ...
+
+    @dy.rule()
+    def my_rule() -> pl.Expr:
+        ...
+```
+
+turns into
+
+```python
+class MySchema(dy.Schema):
+    ...
+
+    @dy.rule()
+    def my_rule(cls) -> pl.Expr:
+        ...
+```
+
+Within the schema rule, `cls` can now be used to access columns or other information from the schema. Specifically,
+
+```python
+class MySchema(dy.Schema):
+    a = dy.Integer()
+    b = dy.Integer()
+
+    @dy.rule()
+    def my_rule() -> pl.Expr:
+        return MySchema.a.col >= MySchema.b.col
+```
+
+can now be written as
+
+```python
+class MySchema(dy.Schema):
+    a = dy.Integer()
+    b = dy.Integer()
+
+    @dy.rule()
+    def my_rule(cls) -> pl.Expr:
+        return cls.a.col >= cls.b.col
+```
+
+To migrate your existing code without changing behavior, simply add the `cls` argument to the signature of your rules.
+If you are using [ruff](https://docs.astral.sh/ruff/), you will need to add the following to your `pyproject.toml` for
+`ruff` to recognize `@dy.rule` as a decorator that turns a method into a classmethod:
+
+```toml
+[tool.ruff.lint.pep8-naming]
+classmethod-decorators = ["dataframely.rule"]
+```
+
+### Predefined checks for floats are updated
+
+For floating point types ({class}`~dataframely.Float`, {class}`~dataframely.Float32`, {class}`~dataframely.Float64`),
+the `allow_inf_nan` option has been split into `allow_inf` and `allow_nan`, allowing to set these to be set
+independently. Note that the defaults remain the same, i.e., if `allow_inf_nan` wasn't set before, nothing changes.
+
+### Schema conversion functions are renamed
+
+The methods that allow converting a dataframely {class}`~dataframely.Schema` into a schema of another package have been
+renamed to better align with the naming scheme of conversion functions in other packages:
+
+- `sql_schema` &rarr; `to_sqlalchemy_columns`
+- `pyarrow_schema` &rarr; `to_pyarrow_schema`
+- `polars_schema` &rarr; `to_polars_schema`
+
+### Methods related to primary keys are renamed
+
+When talking about the "primary key" of a schema, a primary key may span multiple columns, yet, it is still a single
+(composite) primary key. To align with this notion in the code, we rename primary key-related methods:
+
+- `Schema.primary_keys` &rarr; `Schema.primary_key`
+- `Collection.common_primary_keys` &rarr; `Collection.common_primary_key`
+
+### Utility functions for collection filters are renamed and safer
+
+For writing collection filters, dataframely exposes two utility functions to express the `1:1` and `1:{1,N}`
+relationships between members. These have been renamed as follows:
+
+- `filter_relationship_one_to_one` &rarr; `require_relationship_one_to_one`
+- `filter_relationship_one_to_at_least_one` &rarr; `require_relationship_one_to_at_least_one`
+
+Additionally, their behavior changes: even if primary key constraints are not enforced on the schema, the method
+now behaves correctly. Previously, the validation result could duplicate input rows.
+
+If the relationships are already enforced by primary key constraints on the schemas[^1], you can still specify
+`drop_duplicates=False`. This returns to the previous behavior and allows for considerable performance improvements.
+
+[^1]: This is often the case if the filter's purpose is to remove rows that exist in one member but not the other.
+
+### Collection metadata cannot be read from `schema.json` anymore
+
+Prior to dataframely v1.8.0, collection metadata has been serialized as a `schema.json` file when calling
+`write_parquet` or `scan_parquet` on a collection. Since dataframely v1.8.0, the metadata has been moved to the
+individual members' parquet metadata.
+
+While dataframely v1 still supported reading the metadata from collections written with a version of dataframely prior
+to v1.8.0, dataframely v2 removes this support. If you still have data written with a version of dataframely earlier
+than v1.8.0, and, thus, still have `schema.json` files, you can migrate your data by reading it and writing it back to
+disk with any version of dataframely `>=1.8.0,<2`.
+
+### The mypy plugin is removed entirely
+
+The mypy plugin in dataframely v1 had two purposes:
+
+- Ensure that a method with `@dy.rule` decorator is recognized as a rule
+- Turn non-specific return types into ones with enriched type information (e.g. `dict` &rarr; `TypedDict`)
+
+Unfortunately, the latter was error-prone as it yielded many false positives and generally made working with these
+types less ergonomic. We therefore actively removed this part. With `@dy.rule` being applied to classmethods, the need
+for a custom mypy plugin is eliminated entirely. As a result, `dataframely.mypy` has been removed.
+
+If you have used the mypy plugin before, you can remove the following from your `pyproject.toml`:
+
+```toml
+[tool.mypy]
+plugins = ["dataframely.mypy"]
+```