Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 0 additions & 8 deletions dataframely/collection/collection.py
Original file line number Diff line number Diff line change
Expand Up @@ -956,11 +956,6 @@ def scan_parquet(
ValueError: If the provided directory does not contain parquet files for
all required members.

Note:
Due to current limitations in dataframely, this method actually reads the
parquet file into memory if `"validation"` is `"warn"` or `"allow"`
and validation is required.

Attention:
Be aware that this method suffers from the same limitations as
:meth:`serialize`.
Expand Down Expand Up @@ -1049,9 +1044,6 @@ def scan_delta(
ValueError:
If the provided source does not contain Delta tables for all required members.

Note:
Due to current limitations in dataframely, this method may read the Delta table into memory if `validation` is `"warn"` or `"allow"` and validation is required.

Attention:
Schema metadata is stored as custom commit metadata. Only the schema
information from the last commit is used, so any table modifications
Expand Down
13 changes: 13 additions & 0 deletions docs/_templates/classes/error.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
:html_theme.sidebar_secondary.remove: true

.. role:: hidden

{{ name | underline }}

.. currentmodule:: {{ module }}

.. autoclass:: {{ name }}
:members:
:exclude-members: add_note, with_traceback
:autosummary:
:autosummary-nosignatures:
15 changes: 15 additions & 0 deletions docs/api/errors/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
======
Errors
======

.. currentmodule:: dataframely
.. autosummary::
:toctree: _gen/
:template: classes/error.rst
:nosignatures:

~exc.SchemaError
~exc.ValidationError
~exc.ImplementationError
~exc.AnnotationImplementationError
~exc.ValidationRequiredError
9 changes: 9 additions & 0 deletions docs/api/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,15 @@ API Reference

columns/index

.. grid::

.. grid-item-card::

.. toctree::
:maxdepth: 1

errors/index

.. grid-item-card::

.. toctree::
Expand Down
4 changes: 3 additions & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,9 @@ def hide_class_signature(
return_annotation: str,
) -> tuple[str, str] | None:
if what == "class" and (
name.endswith("FilterResult") or name.endswith("FailureInfo")
name.endswith("FilterResult")
or name.endswith("FailureInfo")
or name.endswith("AnnotationImplementationError")
):
# Return empty signature (no args after the class name)
return "", return_annotation
Expand Down
17 changes: 17 additions & 0 deletions docs/guides/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,3 +29,20 @@ class UserSchema(dy.Schema):
"""Email must be unique, if provided."""
return pl.col("email").is_null() | pl.col("email").is_unique()
```

## How do I fix the ruff error `First argument of a method should be named self`?

If you are using [`ruff`](https://docs.astral.sh/ruff/) and introduce custom rules for your schemas, `ruff` will create
the following linting error:

```
N805 First argument of a method should be named `self`
```

To fix this, you'll need to let `ruff` know that the `@dy.rule` decorator is applied to classmethods. This can easily
be done by adding the following to your `pyproject.toml`:

```toml
[tool.ruff.lint.pep8-naming]
classmethod-decorators = ["dataframely.rule"]
```
1 change: 1 addition & 0 deletions docs/guides/features/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,5 @@ data-generation
primary-keys
serialization
sql-generation
lazy-validation
```
41 changes: 41 additions & 0 deletions docs/guides/features/lazy-validation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Lazy Validation

In many cases, dataframely's capability to validate and filter input data is used at core application boundaries.
As a result, `validate` and `filter` are generally expected to be used at points where `collect` is called on a lazy
frame. However, there may be situations where validation or filtering should simply be added to the lazy computation
graph. Starting in dataframely v2, this is supported via a custom polars plugin.

## The `eager` parameter

All of the following methods expose an `eager: bool` parameter:

- {meth}`Schema.validate() <dataframely.Schema.validate>`
- {meth}`Schema.filter() <dataframely.Schema.filter>`
- {meth}`Collection.validate() <dataframely.Collection.validate>`
- {meth}`Collection.filter() <dataframely.Collection.filter>`

By default, `eager=True`. However, users may decide to set `eager=False` in order to simply append the validation or
the filtering operation to the lazy frame. For example, one might decide to run validation lazily:

```python
def validate_lf(lf: pl.LazyFrame) -> pl.LazyFrame:
return lf.pipe(MySchema.validate, eager=False)
```

When `eager=False`, validation is only run once the lazy frame is collected. If input data does not satisfy the schema,
no error is raised here, yet.

## Error Types

Due to current limitations in polars plugins, the type of error that is being raised from the `validate` function (both
for schemas and collections) is dependent on the value of the `eager` parameter:

- When `eager=True`, a {class}`~dataframely.ValidationError` is raised from the `validate` function
- When `eager=False`, a {class}`~polars.exceptions.ComputeError` is raised from the `collect` function

```{note}
For schemas, the error _message_ itself is equivalent.
For collections, the error message for `eager=False` is limited and non-deterministic: the error message only includes
information about a single member and, if multiple members fail validation, the member that the error message refers to
may vary across executions.
```
2 changes: 1 addition & 1 deletion docs/guides/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ quickstart
examples/index
features/index
development
versioning
migration/index
faq
```

Expand Down
11 changes: 10 additions & 1 deletion docs/guides/versioning.md → docs/guides/migration/index.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,13 @@
# Versioning policy and breaking changes
# Migration Guides

```{toctree}
:maxdepth: 1
:hidden:

v1-v2
```

## Versioning policy and breaking changes

Dataframely uses [semantic versioning](https://semver.org/).
This versioning scheme is designed to make it easy for users to anticipate what types of change they can expect from a
Expand Down
161 changes: 161 additions & 0 deletions docs/guides/migration/v1-v2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
# Migrating from v1 to v2

Dataframely v2 introduces several improvements and some breaking changes to streamline the API.

## Improvements

### Lazy Validation

Dataframely v2 finally implements lazy validation and filtering using a custom polars plugin. This allows
{meth}`Schema.validate` and {meth}`Schema.filter` to be used within lazy computation graphs instead of forcing a
`collect`. More details can be found in the [dedicated guide](../features/lazy-validation.md).

### Lazy `scan` operations

With lazy validation, all `scan_*` methods (e.g. {meth}`Schema.scan_parquet`, {meth}`Collection.scan_delta`, ...) are
now truly lazy, even if validation is necessary. Previously, this required collecting the input and running validation
eagerly.

### S3 Support in all I/O functions

Dataframely v2 now properly supports S3 for all I/O functions (i.e. `write_*`, `sink_*`, `read_*`, `scan_*`).

## Breaking Changes

### Columns are non-nullable by default

In dataframely v1, specifying a column without setting the `nullable` property caused the column to be nullable and a
warning was emitted. In dataframely v2, this changes: no warning is emitted anymore and `nullable` defaults to `False`.
This mirrors the typical expectation that a column is not nullable (because `null` values often indicate issues) --
nullability now becomes opt-in.

### Primary key columns may not be nullable

While dataframely v1 merely emitted a warning, dataframely v2 now raises an exception if a primary key is designated
as non-nullable. This aligns dataframely, for example, with SQL where primary key columns may not be nullable.

### Schema rules are now defined as classmethods

In order to allow schema rules to access information about the schema and, especially, information of a schema's
subclasses, schema rules must now be specified as classmethods. This means:

```python
class MySchema(dy.Schema):
...

@dy.rule()
def my_rule() -> pl.Expr:
...
```

turns into

```python
class MySchema(dy.Schema):
...

@dy.rule()
def my_rule(cls) -> pl.Expr:
...
```

Within the schema rule, `cls` can now be used to access columns or other information from the schema. Specifically,

```python
class MySchema(dy.Schema):
a = dy.Integer()
b = dy.Integer()

@dy.rule()
def my_rule() -> pl.Expr:
return MySchema.a.col >= MySchema.b.col
```

can now be written as

```python
class MySchema(dy.Schema):
a = dy.Integer()
b = dy.Integer()

@dy.rule()
def my_rule(cls) -> pl.Expr:
return cls.a.col >= cls.b.col
```

To migrate your existing code without changing behavior, simply add the `cls` argument to the signature of your rules.
If you are using [ruff](https://docs.astral.sh/ruff/), you will need to add the following to your `pyproject.toml` for
`ruff` to recognize `@dy.rule` as a decorator that turns a method into a classmethod:

```toml
[tool.ruff.lint.pep8-naming]
classmethod-decorators = ["dataframely.rule"]
```

### Predefined checks for floats are updated

For floating point types ({class}`~dataframely.Float`, {class}`~dataframely.Float32`, {class}`~dataframely.Float64`),
the `allow_inf_nan` option has been split into `allow_inf` and `allow_nan`, allowing to set these to be set
independently. Note that the defaults remain the same, i.e., if `allow_inf_nan` wasn't set before, nothing changes.

### Schema conversion functions are renamed

The methods that allow converting a dataframely {class}`~dataframely.Schema` into a schema of another package have been
renamed to better align with the naming scheme of conversion functions in other packages:

- `sql_schema` &rarr; `to_sqlalchemy_columns`
- `pyarrow_schema` &rarr; `to_pyarrow_schema`
- `polars_schema` &rarr; `to_polars_schema`

### Methods related to primary keys are renamed

When talking about the "primary key" of a schema, a primary key may span multiple columns, yet, it is still a single
(composite) primary key. To align with this notion in the code, we rename primary key-related methods:

- `Schema.primary_keys` &rarr; `Schema.primary_key`
- `Collection.common_primary_keys` &rarr; `Collection.common_primary_key`

### Utility functions for collection filters are renamed and safer

For writing collection filters, dataframely exposes two utility functions to express the `1:1` and `1:{1,N}`
relationships between members. These have been renamed as follows:

- `filter_relationship_one_to_one` &rarr; `require_relationship_one_to_one`
- `filter_relationship_one_to_at_least_one` &rarr; `require_relationship_one_to_at_least_one`

Additionally, their behavior changes: even if primary key constraints are not enforced on the schema, the method
now behaves correctly. Previously, the validation result could duplicate input rows.

If the relationships are already enforced by primary key constraints on the schemas[^1], you can still specify
`drop_duplicates=False`. This returns to the previous behavior and allows for considerable performance improvements.

[^1]: This is often the case if the filter's purpose is to remove rows that exist in one member but not the other.

### Collection metadata cannot be read from `schema.json` anymore

Prior to dataframely v1.8.0, collection metadata has been serialized as a `schema.json` file when calling
`write_parquet` or `scan_parquet` on a collection. Since dataframely v1.8.0, the metadata has been moved to the
individual members' parquet metadata.

While dataframely v1 still supported reading the metadata from collections written with a version of dataframely prior
to v1.8.0, dataframely v2 removes this support. If you still have data written with a version of dataframely earlier
than v1.8.0, and, thus, still have `schema.json` files, you can migrate your data by reading it and writing it back to
disk with any version of dataframely `>=1.8.0,<2`.

### The mypy plugin is removed entirely

The mypy plugin in dataframely v1 had two purposes:

- Ensure that a method with `@dy.rule` decorator is recognized as a rule
- Turn non-specific return types into ones with enriched type information (e.g. `dict` &rarr; `TypedDict`)

Unfortunately, the latter was error-prone as it yielded many false positives and generally made working with these
types less ergonomic. We therefore actively removed this part. With `@dy.rule` being applied to classmethods, the need
for a custom mypy plugin is eliminated entirely. As a result, `dataframely.mypy` has been removed.

If you have used the mypy plugin before, you can remove the following from your `pyproject.toml`:

```toml
[tool.mypy]
plugins = ["dataframely.mypy"]
```
Loading