Quantco · borchero · Nov 4, 2025 · Nov 4, 2025
@@ -0,0 +1,214 @@
+# Dataframely - Coding Agent Instructions
+
+## Project Overview
+
+Dataframely is a declarative, polars-native data frame validation library. It validates schemas and data content in
+polars DataFrames using native polars expressions and a custom Rust-based polars plugin for high performance. It
+supports validating individual data frames via `Schema` classes and interconnected data frames via `Collection` classes.
+
+## Tech Stack
+
+### Core Technologies
+
+- **Python**: Primary language for the public API
+- **Rust**: Backend for polars plugin and custom regex operations
+- **Polars**: Only supported data frame library
+- **pyo3 & maturin**: Rust-Python bindings and build system
+- **pixi**: Primary environment and task manager (NOT pip/conda directly)
+
+### Build System
+
+- **maturin**: Builds the Rust extension module `dataframely._native`
+- **Cargo**: Rust dependency management
+- Rust toolchain specified in `rust-toolchain.toml` with clippy and rustfmt components
+
+## Environment Setup
+
+**CRITICAL**: Always use `pixi` commands - never run `pip`, `conda`, `python`, or `cargo` directly unless specifically
+required for Rust-only operations.
+
+### Initial Setup
+
+Unless already performed via external setup steps:
+
+```bash
+# Install Rust toolchain
+rustup show
+
+# Install pixi environment and dependencies
+pixi install
+
+# Build and install the package locally (REQUIRED after Rust changes)
+pixi run postinstall
+```
+
+### After Rust Code Changes
+
+**Always run** `pixi run postinstall` after modifying any Rust code in `src/` to rebuild the native extension.
+
+## Development Workflow
+
+### Running Tests
+
+```bash
+# Run all tests (excludes S3 tests by default)
+pixi run test
+
+# Run tests with S3 backend (requires moto server)
+pixi run test -m s3
+
+# Run specific test file or directory
+pixi run test tests/schema/
+
+# Run with coverage
+pixi run test-coverage
+
+# Run benchmarks
+pixi run test-bench
+```
+
+### Code Quality
+
+**NEVER** run linters/formatters directly. Use pre-commit:
+
+```bash
+# Run all pre-commit hooks
+pixi run pre-commit run
+```
+
+Pre-commit handles:
+
+- **Python**: ruff (lint & format), mypy (type checking), docformatter
+- **Rust**: cargo fmt, cargo clippy
+- **Other**: prettier (md/yml), taplo (toml), license headers, trailing whitespace
+
+### Building Documentation
+
+```bash
+# Build documentation
+pixi run -e docs postinstall
+pixi run docs
+
+# Open in browser (macOS)
+open docs/_build/html/index.html
+```
+
+## Project Structure
+
+```
+dataframely/              # Python package
+  schema.py              # Core Schema class for DataFrame validation
+  collection/            # Collection class for validating multiple interconnected DataFrames
+  columns/               # Column type definitions (String, Integer, Float, etc.)
+  testing/               # Testing utilities (factories, masks, storage mocks)
+  _storage/              # Storage backends (Parquet, Delta Lake)
+  _rule.py               # Rule decorator for validation rules
+  _plugin.py             # Polars plugin registration
+  _native.pyi            # Type stubs for Rust extension
+
+src/                     # Rust source code
+  lib.rs                 # PyO3 module definition
+  polars_plugin/         # Custom polars plugin for validation
+  regex/                 # Custom regex operations
+
+tests/                   # Unit tests (mirrors dataframely/ structure)
+  benches/               # Benchmark tests
+  conftest.py            # Shared pytest fixtures (including s3_server)
+
+docs/                    # Sphinx documentation
+  guides/                # User guides and examples
+  api/                   # Auto-generated API reference
+```
+
+## Pixi Environments
+
+Multiple environments for different purposes:
+
+- **default**: Base Python + core dependencies
+- **dev**: Includes jupyter for notebooks
+- **test**: Testing dependencies (pytest, moto, boto3, etc.)
+- **docs**: Documentation building (sphinx, myst-parser, etc.)
+- **lint**: Linting and formatting tools
+- **optionals**: Optional dependencies (pydantic, deltalake, pyarrow, sqlalchemy)
+- **py310-py314**: Python version-specific environments
+
+Use `-e <env>` to run commands in specific environments:
+
+```bash
+pixi run -e test test
+pixi run -e docs docs
+```
+
+## API Design Principles
+
+### Critical Guidelines
+
+1. **NO BREAKING CHANGES**: Public API must remain backward compatible
+2. **100% Test Coverage**: All new code requires tests
+3. **Documentation Required**: All public features need docstrings + API docs
+4. **Cautious API Extension**: Avoid adding to public API unless necessary
+
+### Public API
+
+Public exports are in `dataframely/__init__.py`. Main components:
+
+- **Schema classes**: `Schema` for DataFrame validation
+- **Collection classes**: `Collection`, `CollectionMember` for multi-DataFrame validation
+- **Column types**: `String`, `Integer`, `Float`, `Bool`, `Date`, `Datetime`, etc.
+- **Decorators**: `@rule()`, `@filter()`
+- **Type hints**: `DataFrame[Schema]`, `LazyFrame[Schema]`, `Validation`
+
+## Common Pitfalls & Solutions
+
+### S3 Testing
+
+The `s3_server` fixture in `tests/conftest.py` uses `subprocess.Popen` to start moto_server on port 9999. This is a **workaround** for a polars issue with ThreadedMotoServer. When the polars issue is fixed, it should be replaced with ThreadedMotoServer (code is commented in the file).
+
+**Note**: CI skips S3 tests by default. Run with `pixi run test -m s3` when modifying storage backends.
+
+## Testing Strategy
+
+- Tests are organized by module, mirroring the `dataframely/` structure
+- Use `dy.Schema.sample()` for generating test data
+- Test both eager (`DataFrame`) and lazy (`LazyFrame`) execution
+- S3 tests use moto server fixture from `conftest.py`
+- Benchmark tests in `tests/benches/` use pytest-benchmark
+
+## Validation Pattern
+
+Typical usage pattern:
+
+```python
+class MySchema(dy.Schema):
+    col = dy.String(nullable=False)
+
+    @dy.rule()
+    def my_rule(cls) -> pl.Expr:
+        return pl.col("col").str.len_chars() > 0
+
+# Validate and cast
+validated_df: dy.DataFrame[MySchema] = MySchema.validate(df, cast=True)
+```
+
+## Key Configuration Files
+
+- `pixi.toml`: Environment and task definitions
+- `pyproject.toml`: Python package metadata, tool configurations (ruff, mypy, pytest)
+- `Cargo.toml`: Rust dependencies and build settings
+- `.pre-commit-config.yaml`: All code quality checks
+- `rust-toolchain.toml`: Rust nightly version specification
+
+## When Making Changes
+
+1. **Python code**: Run `pixi run pre-commit run` before committing
+2. **Rust code**: Run `pixi run postinstall` to rebuild, then run tests
+3. **Tests**: Ensure `pixi run test` passes
+4. **Documentation**: Update docstrings
+5. **API changes**: Ensure backward compatibility or document migration path
+
+## Performance Considerations
+
+- Validation uses native polars expressions for performance
+- Custom Rust plugin for advanced validation logic
+- Lazy evaluation supported via `LazyFrame` for large datasets
+- Avoid materializing data unnecessarily in validation rules
@@ -0,0 +1,26 @@
+name: Copilot Setup Steps
+on:
+  pull_request:
+    paths:
+      - .github/workflows/copilot-setup-steps.yml
+  workflow_dispatch:
+
+jobs:
+  copilot-setup-steps:
+    runs-on: ubuntu-latest
+    permissions:
+      contents: read
+      id-token: write
+    steps:
+      - name: Checkout branch
+        uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
+      - name: Set up pixi
+        uses: prefix-dev/setup-pixi@28eb668aafebd9dede9d97c4ba1cd9989a4d0004 # v0.9.2
+        with:
+          environments: default
+      - name: Install Rust
+        run: rustup show
+      - name: Cache Rust dependencies
+        uses: Swatinem/rust-cache@f13886b937689c021905a6b90929199931d60db1 # v2.8.1
+      - name: Install repository
+        run: pixi run postinstall