Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 17 additions & 17 deletions dataframely/collection.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ class Collection(BaseCollection, ABC):
represent "semantic objects" which cannot be represented in a single data frame due
to 1-N relationships that are managed in separate data frames.

A collection must only have type annotations for :class:`~dataframely.LazyFrame`s
A collection must only have type annotations for :class:`~dataframely.LazyFrame`
with known schema:

.. code:: python
Expand Down Expand Up @@ -786,20 +786,20 @@ def read_parquet(
Parquet files may have been written with Hive partitioning.
validation: The strategy for running validation when reading the data:

- ``"allow"`: The method tries to read the schema data from the parquet
- ``"allow"``: The method tries to read the schema data from the parquet
files. If the stored collection schema matches this collection
schema, the collection is read without validation. If the stored
schema mismatches this schema no metadata can be found in
the parquets, or the files have conflicting metadata,
this method automatically runs :meth:`validate` with ``cast=True``.
- ``"warn"`: The method behaves similarly to ``"allow"``. However,
- ``"warn"``: The method behaves similarly to ``"allow"``. However,
it prints a warning if validation is necessary.
- ``"forbid"``: The method never runs validation automatically and only
returns if the metadata stores a collection schema that matches
this collection.
- ``"skip"``: The method never runs validation and simply reads the
data, entrusting the user that the schema is valid. _Use this option
carefully_.
data, entrusting the user that the schema is valid. *Use this option
carefully*.

kwargs: Additional keyword arguments passed directly to
:meth:`polars.read_parquet`.
Expand Down Expand Up @@ -849,20 +849,20 @@ def scan_parquet(
Parquet files may have been written with Hive partitioning.
validation: The strategy for running validation when reading the data:

- ``"allow"`: The method tries to read the schema data from the parquet
- ``"allow"``: The method tries to read the schema data from the parquet
files. If the stored collection schema matches this collection
schema, the collection is read without validation. If the stored
schema mismatches this schema no metadata can be found in
the parquets, or the files have conflicting metadata,
this method automatically runs :meth:`validate` with ``cast=True``.
- ``"warn"`: The method behaves similarly to ``"allow"``. However,
- ``"warn"``: The method behaves similarly to ``"allow"``. However,
it prints a warning if validation is necessary.
- ``"forbid"``: The method never runs validation automatically and only
returns if the metadata stores a collection schema that matches
this collection.
- ``"skip"``: The method never runs validation and simply reads the
data, entrusting the user that the schema is valid. _Use this option
carefully_.
data, entrusting the user that the schema is valid. *Use this option
carefully*.

kwargs: Additional keyword arguments passed directly to
:meth:`polars.scan_parquet` for all members.
Expand Down Expand Up @@ -947,20 +947,20 @@ def scan_delta(
source: The location or DeltaTable to read from.
validation: The strategy for running validation when reading the data:

- ``"allow"`: The method tries to read the schema data from the parquet
- ``"allow"``: The method tries to read the schema data from the parquet
files. If the stored collection schema matches this collection
schema, the collection is read without validation. If the stored
schema mismatches this schema no metadata can be found in
the parquets, or the files have conflicting metadata,
this method automatically runs :meth:`validate` with ``cast=True``.
- ``"warn"`: The method behaves similarly to ``"allow"``. However,
- ``"warn"``: The method behaves similarly to ``"allow"``. However,
it prints a warning if validation is necessary.
- ``"forbid"``: The method never runs validation automatically and only
returns if the metadata stores a collection schema that matches
this collection.
- ``"skip"``: The method never runs validation and simply reads the
data, entrusting the user that the schema is valid. _Use this option
carefully_.
data, entrusting the user that the schema is valid. *Use this option
carefully*.

kwargs: Additional keyword arguments passed directly to :meth:`polars.scan_delta`.

Expand Down Expand Up @@ -1010,20 +1010,20 @@ def read_delta(
source: The location or DeltaTable to read from.
validation: The strategy for running validation when reading the data:

- ``"allow"`: The method tries to read the schema data from the parquet
- ``"allow"``: The method tries to read the schema data from the parquet
files. If the stored collection schema matches this collection
schema, the collection is read without validation. If the stored
schema mismatches this schema no metadata can be found in
the parquets, or the files have conflicting metadata,
this method automatically runs :meth:`validate` with ``cast=True``.
- ``"warn"`: The method behaves similarly to ``"allow"``. However,
- ``"warn"``: The method behaves similarly to ``"allow"``. However,
it prints a warning if validation is necessary.
- ``"forbid"``: The method never runs validation automatically and only
returns if the metadata stores a collection schema that matches
this collection.
- ``"skip"``: The method never runs validation and simply reads the
data, entrusting the user that the schema is valid. _Use this option
carefully_.
data, entrusting the user that the schema is valid. *Use this option
carefully*.

kwargs: Additional keyword arguments passed directly to :meth:`polars.read_delta`.

Expand Down
7 changes: 7 additions & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,15 @@
"sphinx.ext.autodoc",
"sphinx.ext.linkcode",
"sphinxcontrib.apidoc",
"myst_parser",
]

myst_parser_config = {"myst_enable_extensions": ["rst_eval_roles"]}
source_suffix = {
".rst": "restructuredtext",
".txt": "markdown",
".md": "markdown",
}
numpydoc_class_members_toctree = False

apidoc_module_dir = "../dataframely"
Expand Down
50 changes: 26 additions & 24 deletions docs/index.rst → docs/index.md
Original file line number Diff line number Diff line change
@@ -1,41 +1,41 @@
Dataframely
============
# Dataframely

Dataframely is a Python package to validate the schema and content of `polars <https://pola.rs/>`_ data frames.
Its purpose is to make data pipelines more robust by ensuring that data meet expectations and more readable by adding schema information to data frame type hints.
Dataframely is a Python package to validate the schema and content of [polars](https://pola.rs/)\_ data frames.
Its purpose is to make data pipelines more robust by ensuring that data meet expectations and more readable by adding
schema information to data frame type hints.

Features
--------
## Features

- Declaratively define schemas as classes with arbitrary inheritance structure
- Specify column-specific validation rules (e.g. nullability, minimum string length, ...)
- Specify cross-column and group validation rules with built-in support for checking the primary key property of a column set
- Specify cross-column and group validation rules with built-in support for checking the primary key property of a
column set
- Specify validation constraints across collections of interdependent data frames
- Validate data frames softly by simply filtering out rows violating rules instead of failing hard
- Introspect validation failure information for run-time failures
- Enhanced type hints for validated data frames allowing users to clearly express expectations about inputs and outputs (i.e., contracts) in data pipelines
- Integrate schemas with external tools (e.g., ``sqlalchemy`` or ``pyarrow``)
- Enhanced type hints for validated data frames allowing users to clearly express expectations about inputs and
outputs (i.e., contracts) in data pipelines
- Integrate schemas with external tools (e.g., `sqlalchemy` or `pyarrow`)
- Generate test data that comply with a schema or collection of schemas and its validation rules

Contents
========
## Contents

.. toctree::
:caption: Contents
:maxdepth: 2
```{toctree}
:caption: Contents
:maxdepth: 2

Installation <sites/installation.rst>
Quickstart <sites/quickstart.rst>
Real-world Example <sites/examples/real-world.ipynb>
Features <sites/features/index.rst>
FAQ <sites/faq.rst>
Development Guide <sites/development.rst>
Versioning <sites/versioning.rst>
sites/installation
sites/quickstart
sites/examples/real-world
sites/features/index.md
sites/faq.md
sites/development.md
sites/versioning.md
```

API Documentation
=================
## API Documentation

.. toctree::
```{toctree}
:caption: API Documentation
:maxdepth: 1

Expand All @@ -45,3 +45,5 @@ API Documentation
Random Data Generation <_api/dataframely.random>
Failure Information <_api/dataframely.failure>
Schema <_api/dataframely.schema>

```
46 changes: 46 additions & 0 deletions docs/sites/development.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Development

Thanks for deciding to work on `dataframely`!
You can create a development environment with the following steps:

## Environment Installation

```bash
git clone https://github.com/Quantco/dataframely
cd dataframely
pixi install
```

Next make sure to install the package locally and set up pre-commit hooks:

```bash
pixi run postinstall
pixi run pre-commit-install
```

## Running the tests

```bash
pixi run test
```

You can adjust the `tests/` path to run tests in a specific directory or module.

## Documentation

We use [Sphinx](https://www.sphinx-doc.org/en/master/index.html) together
with [MyST](https://myst-parser.readthedocs.io/), and write user documentation in markdown.
If you are not yet familiar with this setup,
the [MyST docs for Sphinx](https://myst-parser.readthedocs.io/en/v0.17.2/sphinx/intro.html) are a good starting point.

When updating the documentation, you can compile a localized build of the
documentation and then open it in your web browser using the commands below:

```bash
# Run build
pixi run -e docs postinstall
pixi run docs

# Open documentation
open docs/_build/html/index.html
```
49 changes: 0 additions & 49 deletions docs/sites/development.rst

This file was deleted.

30 changes: 30 additions & 0 deletions docs/sites/faq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# FAQ

Whenever you find out something that you were surprised by or needed some non-trivial
thinking, please add it here.

## How do I define additional unique keys in a `dy.Schema`?

By default, `dataframely` only supports defining a single non-nullable (composite) primary key in `dy.Schema`.
However, in some scenarios it may be useful to define additional unique keys (which support nullable fields and/or which
are additionally unique).

Consider the following example, which demonstrates two rules: one for validating that a field is entirely unique, and
another for validating that a field, when provided, is unique.

```python
class UserSchema(dy.Schema):
user_id = dy.UInt64(primary_key=True, nullable=False)
username = dy.String(nullable=False)
email = dy.String(nullable=True) # Must be unique, or null.

@dy.rule(group_by=["username"])
def unique_username() -> pl.Expr:
"""Username, a non-nullable field, must be total unique."""
return pl.len() == 1

@dy.rule()
def unique_email_or_null() -> pl.Expr:
"""Email must be unique, if provided."""
return pl.col("email").is_null() | pl.col("email").is_unique()
```
30 changes: 0 additions & 30 deletions docs/sites/faq.rst

This file was deleted.

5 changes: 5 additions & 0 deletions docs/sites/features/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Features

```{toctree}
primary-keys
```
7 changes: 0 additions & 7 deletions docs/sites/features/index.rst

This file was deleted.

Loading
Loading