Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
214 changes: 214 additions & 0 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,214 @@
# Dataframely - Coding Agent Instructions

## Project Overview

Dataframely is a declarative, polars-native data frame validation library. It validates schemas and data content in
polars DataFrames using native polars expressions and a custom Rust-based polars plugin for high performance. It
supports validating individual data frames via `Schema` classes and interconnected data frames via `Collection` classes.

## Tech Stack

### Core Technologies

- **Python**: Primary language for the public API
- **Rust**: Backend for polars plugin and custom regex operations
- **Polars**: Only supported data frame library
- **pyo3 & maturin**: Rust-Python bindings and build system
- **pixi**: Primary environment and task manager (NOT pip/conda directly)

### Build System

- **maturin**: Builds the Rust extension module `dataframely._native`
- **Cargo**: Rust dependency management
- Rust toolchain specified in `rust-toolchain.toml` with clippy and rustfmt components

## Environment Setup

**CRITICAL**: Always use `pixi` commands - never run `pip`, `conda`, `python`, or `cargo` directly unless specifically
required for Rust-only operations.

### Initial Setup

Unless already performed via external setup steps:

```bash
# Install Rust toolchain
rustup show

# Install pixi environment and dependencies
pixi install

# Build and install the package locally (REQUIRED after Rust changes)
pixi run postinstall
```

### After Rust Code Changes

**Always run** `pixi run postinstall` after modifying any Rust code in `src/` to rebuild the native extension.

## Development Workflow

### Running Tests

```bash
# Run all tests (excludes S3 tests by default)
pixi run test

# Run tests with S3 backend (requires moto server)
pixi run test -m s3

# Run specific test file or directory
pixi run test tests/schema/

# Run with coverage
pixi run test-coverage

# Run benchmarks
pixi run test-bench
```

### Code Quality

**NEVER** run linters/formatters directly. Use pre-commit:

```bash
# Run all pre-commit hooks
pixi run pre-commit run
```

Pre-commit handles:

- **Python**: ruff (lint & format), mypy (type checking), docformatter
- **Rust**: cargo fmt, cargo clippy
- **Other**: prettier (md/yml), taplo (toml), license headers, trailing whitespace

### Building Documentation

```bash
# Build documentation
pixi run -e docs postinstall
pixi run docs

# Open in browser (macOS)
open docs/_build/html/index.html
```

## Project Structure

```
dataframely/ # Python package
schema.py # Core Schema class for DataFrame validation
collection/ # Collection class for validating multiple interconnected DataFrames
columns/ # Column type definitions (String, Integer, Float, etc.)
testing/ # Testing utilities (factories, masks, storage mocks)
_storage/ # Storage backends (Parquet, Delta Lake)
_rule.py # Rule decorator for validation rules
_plugin.py # Polars plugin registration
_native.pyi # Type stubs for Rust extension

src/ # Rust source code
lib.rs # PyO3 module definition
polars_plugin/ # Custom polars plugin for validation
regex/ # Custom regex operations

tests/ # Unit tests (mirrors dataframely/ structure)
benches/ # Benchmark tests
conftest.py # Shared pytest fixtures (including s3_server)

docs/ # Sphinx documentation
guides/ # User guides and examples
api/ # Auto-generated API reference
```

## Pixi Environments

Multiple environments for different purposes:

- **default**: Base Python + core dependencies
- **dev**: Includes jupyter for notebooks
- **test**: Testing dependencies (pytest, moto, boto3, etc.)
- **docs**: Documentation building (sphinx, myst-parser, etc.)
- **lint**: Linting and formatting tools
- **optionals**: Optional dependencies (pydantic, deltalake, pyarrow, sqlalchemy)
- **py310-py314**: Python version-specific environments

Use `-e <env>` to run commands in specific environments:

```bash
pixi run -e test test
pixi run -e docs docs
```

## API Design Principles

### Critical Guidelines

1. **NO BREAKING CHANGES**: Public API must remain backward compatible
2. **100% Test Coverage**: All new code requires tests
3. **Documentation Required**: All public features need docstrings + API docs
4. **Cautious API Extension**: Avoid adding to public API unless necessary

### Public API

Public exports are in `dataframely/__init__.py`. Main components:

- **Schema classes**: `Schema` for DataFrame validation
- **Collection classes**: `Collection`, `CollectionMember` for multi-DataFrame validation
- **Column types**: `String`, `Integer`, `Float`, `Bool`, `Date`, `Datetime`, etc.
- **Decorators**: `@rule()`, `@filter()`
- **Type hints**: `DataFrame[Schema]`, `LazyFrame[Schema]`, `Validation`

## Common Pitfalls & Solutions

### S3 Testing

The `s3_server` fixture in `tests/conftest.py` uses `subprocess.Popen` to start moto_server on port 9999. This is a **workaround** for a polars issue with ThreadedMotoServer. When the polars issue is fixed, it should be replaced with ThreadedMotoServer (code is commented in the file).

**Note**: CI skips S3 tests by default. Run with `pixi run test -m s3` when modifying storage backends.

## Testing Strategy

- Tests are organized by module, mirroring the `dataframely/` structure
- Use `dy.Schema.sample()` for generating test data
- Test both eager (`DataFrame`) and lazy (`LazyFrame`) execution
- S3 tests use moto server fixture from `conftest.py`
- Benchmark tests in `tests/benches/` use pytest-benchmark

## Validation Pattern

Typical usage pattern:

```python
class MySchema(dy.Schema):
col = dy.String(nullable=False)

@dy.rule()
def my_rule(cls) -> pl.Expr:
return pl.col("col").str.len_chars() > 0

# Validate and cast
validated_df: dy.DataFrame[MySchema] = MySchema.validate(df, cast=True)
```

## Key Configuration Files

- `pixi.toml`: Environment and task definitions
- `pyproject.toml`: Python package metadata, tool configurations (ruff, mypy, pytest)
- `Cargo.toml`: Rust dependencies and build settings
- `.pre-commit-config.yaml`: All code quality checks
- `rust-toolchain.toml`: Rust nightly version specification

## When Making Changes

1. **Python code**: Run `pixi run pre-commit run` before committing
2. **Rust code**: Run `pixi run postinstall` to rebuild, then run tests
3. **Tests**: Ensure `pixi run test` passes
4. **Documentation**: Update docstrings
5. **API changes**: Ensure backward compatibility or document migration path

## Performance Considerations

- Validation uses native polars expressions for performance
- Custom Rust plugin for advanced validation logic
- Lazy evaluation supported via `LazyFrame` for large datasets
- Avoid materializing data unnecessarily in validation rules
26 changes: 26 additions & 0 deletions .github/workflows/copilot-setup-steps.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
name: Copilot Setup Steps
on:
pull_request:
paths:
- .github/workflows/copilot-setup-steps.yml
workflow_dispatch:

jobs:
copilot-setup-steps:
runs-on: ubuntu-latest
permissions:
contents: read
id-token: write
steps:
- name: Checkout branch
uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
- name: Set up pixi
uses: prefix-dev/setup-pixi@28eb668aafebd9dede9d97c4ba1cd9989a4d0004 # v0.9.2
with:
environments: default
- name: Install Rust
run: rustup show
- name: Cache Rust dependencies
uses: Swatinem/rust-cache@f13886b937689c021905a6b90929199931d60db1 # v2.8.1
- name: Install repository
run: pixi run postinstall
Loading