-
Couldn't load subscription status.
- Fork 13
docs: Add remaining documentation for v2 #202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
6643fe3
docs: Revamp documentation with pydata theme
borchero e77d685
Rename
borchero cedf900
filter and rule
borchero 6d4eb99
docs: Add remaining documentation for v2
borchero 0f69471
Merge branch 'main' into v1-v2
borchero b23aef5
Merge branch 'main' into v1-v2
borchero bb7d9ac
Review
borchero af0e488
Merge branch 'main' into v1-v2
borchero File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,13 @@ | ||
| :html_theme.sidebar_secondary.remove: true | ||
|
|
||
| .. role:: hidden | ||
|
|
||
| {{ name | underline }} | ||
|
|
||
| .. currentmodule:: {{ module }} | ||
|
|
||
| .. autoclass:: {{ name }} | ||
| :members: | ||
| :exclude-members: add_note, with_traceback | ||
| :autosummary: | ||
| :autosummary-nosignatures: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| ====== | ||
| Errors | ||
| ====== | ||
|
|
||
| .. currentmodule:: dataframely | ||
| .. autosummary:: | ||
| :toctree: _gen/ | ||
| :template: classes/error.rst | ||
| :nosignatures: | ||
|
|
||
| ~exc.SchemaError | ||
| ~exc.ValidationError | ||
| ~exc.ImplementationError | ||
| ~exc.AnnotationImplementationError | ||
| ~exc.ValidationRequiredError |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -8,4 +8,5 @@ data-generation | |
| primary-keys | ||
| serialization | ||
| sql-generation | ||
| lazy-validation | ||
| ``` | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,41 @@ | ||
| # Lazy Validation | ||
|
|
||
| In many cases, dataframely's capability to validate and filter input data is used at core application boundaries. | ||
| As a result, `validate` and `filter` are generally expected to be used at points where `collect` is called on a lazy | ||
| frame. However, there may be situations where validation or filtering should simply be added to the lazy computation | ||
| graph. Starting in dataframely v2, this is supported via a custom polars plugin. | ||
|
|
||
| ## The `eager` parameter | ||
|
|
||
| All of the following methods expose an `eager: bool` parameter: | ||
|
|
||
| - {meth}`Schema.validate() <dataframely.Schema.validate>` | ||
| - {meth}`Schema.filter() <dataframely.Schema.filter>` | ||
| - {meth}`Collection.validate() <dataframely.Collection.validate>` | ||
| - {meth}`Collection.filter() <dataframely.Collection.filter>` | ||
|
|
||
| By default, `eager=True`. However, users may decide to set `eager=False` in order to simply append the validation or | ||
| the filtering operation to the lazy frame. For example, one might decide to run validation lazily: | ||
|
|
||
| ```python | ||
| def validate_lf(lf: pl.LazyFrame) -> pl.LazyFrame: | ||
| return lf.pipe(MySchema.validate, eager=False) | ||
| ``` | ||
|
|
||
| When `eager=False`, validation is only run once the lazy frame is collected. If input data does not satisfy the schema, | ||
| no error is raised here, yet. | ||
|
|
||
| ## Error Types | ||
|
|
||
| Due to current limitations in polars plugins, the type of error that is being raised from the `validate` function (both | ||
| for schemas and collections) is dependent on the value of the `eager` parameter: | ||
|
|
||
| - When `eager=True`, a {class}`~dataframely.ValidationError` is raised from the `validate` function | ||
| - When `eager=False`, a {class}`~polars.exceptions.ComputeError` is raised from the `collect` function | ||
|
|
||
| ```{note} | ||
| For schemas, the error _message_ itself is equivalent. | ||
| For collections, the error message for `eager=False` is limited and non-deterministic: the error message only includes | ||
| information about a single member and, if multiple members fail validation, the member that the error message refers to | ||
| may vary across executions. | ||
| ``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -8,7 +8,7 @@ quickstart | |
| examples/index | ||
| features/index | ||
| development | ||
| versioning | ||
| migration/index | ||
| faq | ||
| ``` | ||
|
|
||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,161 @@ | ||
| # Migrating from v1 to v2 | ||
|
|
||
| Dataframely v2 introduces several improvements and some breaking changes to streamline the API. | ||
|
|
||
| ## Improvements | ||
|
|
||
| ### Lazy Validation | ||
|
|
||
| Dataframely v2 finally implements lazy validation and filtering using a custom polars plugin. This allows | ||
| {meth}`Schema.validate` and {meth}`Schema.filter` to be used within lazy computation graphs instead of forcing a | ||
| `collect`. More details can be found in the [dedicated guide](../features/lazy-validation.md). | ||
|
|
||
| ### Lazy `scan` operations | ||
|
|
||
| With lazy validation, all `scan_*` methods (e.g. {meth}`Schema.scan_parquet`, {meth}`Collection.scan_delta`, ...) are | ||
| now truly lazy, even if validation is necessary. Previously, this required collecting the input and running validation | ||
| eagerly. | ||
|
|
||
| ### S3 Support in all I/O functions | ||
|
|
||
| Dataframely v2 now properly supports S3 for all I/O functions (i.e. `write_*`, `sink_*`, `read_*`, `scan_*`). | ||
|
|
||
| ## Breaking Changes | ||
|
|
||
| ### Columns are non-nullable by default | ||
|
|
||
| In dataframely v1, specifying a column without setting the `nullable` property caused the column to be nullable and a | ||
| warning was emitted. In dataframely v2, this changes: no warning is emitted anymore and `nullable` defaults to `False`. | ||
| This mirrors the typical expectation that a column is not nullable (because `null` values often indicate issues) -- | ||
| nullability now becomes opt-in. | ||
|
|
||
| ### Primary key columns may not be nullable | ||
|
|
||
| While dataframely v1 merely emitted a warning, dataframely v2 now raises an exception if a primary key is designated | ||
| as non-nullable. This aligns dataframely, for example, with SQL where primary key columns may not be nullable. | ||
|
|
||
| ### Schema rules are now defined as classmethods | ||
|
|
||
| In order to allow schema rules to access information about the schema and, especially, information of a schema's | ||
| subclasses, schema rules must now be specified as classmethods. This means: | ||
|
|
||
| ```python | ||
| class MySchema(dy.Schema): | ||
| ... | ||
|
|
||
| @dy.rule() | ||
| def my_rule() -> pl.Expr: | ||
| ... | ||
| ``` | ||
|
|
||
| turns into | ||
|
|
||
| ```python | ||
| class MySchema(dy.Schema): | ||
| ... | ||
|
|
||
| @dy.rule() | ||
| def my_rule(cls) -> pl.Expr: | ||
| ... | ||
| ``` | ||
|
|
||
| Within the schema rule, `cls` can now be used to access columns or other information from the schema. Specifically, | ||
|
|
||
| ```python | ||
| class MySchema(dy.Schema): | ||
| a = dy.Integer() | ||
| b = dy.Integer() | ||
|
|
||
| @dy.rule() | ||
| def my_rule() -> pl.Expr: | ||
| return MySchema.a.col >= MySchema.b.col | ||
| ``` | ||
|
|
||
| can now be written as | ||
|
|
||
| ```python | ||
| class MySchema(dy.Schema): | ||
| a = dy.Integer() | ||
| b = dy.Integer() | ||
|
|
||
| @dy.rule() | ||
| def my_rule(cls) -> pl.Expr: | ||
| return cls.a.col >= cls.b.col | ||
| ``` | ||
|
|
||
| To migrate your existing code without changing behavior, simply add the `cls` argument to the signature of your rules. | ||
| If you are using [ruff](https://docs.astral.sh/ruff/), you will need to add the following to your `pyproject.toml` for | ||
| `ruff` to recognize `@dy.rule` as a decorator that turns a method into a classmethod: | ||
|
|
||
| ```toml | ||
| [tool.ruff.lint.pep8-naming] | ||
| classmethod-decorators = ["dataframely.rule"] | ||
borchero marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ``` | ||
|
|
||
| ### Predefined checks for floats are updated | ||
|
|
||
| For floating point types ({class}`~dataframely.Float`, {class}`~dataframely.Float32`, {class}`~dataframely.Float64`), | ||
| the `allow_inf_nan` option has been split into `allow_inf` and `allow_nan`, allowing to set these to be set | ||
| independently. Note that the defaults remain the same, i.e., if `allow_inf_nan` wasn't set before, nothing changes. | ||
|
|
||
| ### Schema conversion functions are renamed | ||
|
|
||
| The methods that allow converting a dataframely {class}`~dataframely.Schema` into a schema of another package have been | ||
| renamed to better align with the naming scheme of conversion functions in other packages: | ||
|
|
||
| - `sql_schema` → `to_sqlalchemy_columns` | ||
| - `pyarrow_schema` → `to_pyarrow_schema` | ||
| - `polars_schema` → `to_polars_schema` | ||
|
|
||
| ### Methods related to primary keys are renamed | ||
|
|
||
| When talking about the "primary key" of a schema, a primary key may span multiple columns, yet, it is still a single | ||
| (composite) primary key. To align with this notion in the code, we rename primary key-related methods: | ||
|
|
||
| - `Schema.primary_keys` → `Schema.primary_key` | ||
| - `Collection.common_primary_keys` → `Collection.common_primary_key` | ||
|
|
||
| ### Utility functions for collection filters are renamed and safer | ||
|
|
||
| For writing collection filters, dataframely exposes two utility functions to express the `1:1` and `1:{1,N}` | ||
| relationships between members. These have been renamed as follows: | ||
|
|
||
| - `filter_relationship_one_to_one` → `require_relationship_one_to_one` | ||
| - `filter_relationship_one_to_at_least_one` → `require_relationship_one_to_at_least_one` | ||
|
|
||
| Additionally, their behavior changes: even if primary key constraints are not enforced on the schema, the method | ||
| now behaves correctly. Previously, the validation result could duplicate input rows. | ||
|
|
||
| If the relationships are already enforced by primary key constraints on the schemas[^1], you can still specify | ||
| `drop_duplicates=False`. This returns to the previous behavior and allows for considerable performance improvements. | ||
|
|
||
| [^1]: This is often the case if the filter's purpose is to remove rows that exist in one member but not the other. | ||
|
|
||
| ### Collection metadata cannot be read from `schema.json` anymore | ||
|
|
||
| Prior to dataframely v1.8.0, collection metadata has been serialized as a `schema.json` file when calling | ||
| `write_parquet` or `scan_parquet` on a collection. Since dataframely v1.8.0, the metadata has been moved to the | ||
| individual members' parquet metadata. | ||
|
|
||
| While dataframely v1 still supported reading the metadata from collections written with a version of dataframely prior | ||
| to v1.8.0, dataframely v2 removes this support. If you still have data written with a version of dataframely earlier | ||
| than v1.8.0, and, thus, still have `schema.json` files, you can migrate your data by reading it and writing it back to | ||
| disk with any version of dataframely `>=1.8.0,<2`. | ||
|
|
||
| ### The mypy plugin is removed entirely | ||
|
|
||
| The mypy plugin in dataframely v1 had two purposes: | ||
|
|
||
| - Ensure that a method with `@dy.rule` decorator is recognized as a rule | ||
| - Turn non-specific return types into ones with enriched type information (e.g. `dict` → `TypedDict`) | ||
|
|
||
| Unfortunately, the latter was error-prone as it yielded many false positives and generally made working with these | ||
| types less ergonomic. We therefore actively removed this part. With `@dy.rule` being applied to classmethods, the need | ||
| for a custom mypy plugin is eliminated entirely. As a result, `dataframely.mypy` has been removed. | ||
|
|
||
| If you have used the mypy plugin before, you can remove the following from your `pyproject.toml`: | ||
|
|
||
| ```toml | ||
| [tool.mypy] | ||
| plugins = ["dataframely.mypy"] | ||
| ``` | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.