PySetl is a framework to improve the readability and structure of PySpark ETL projects. Also, it is designed to take advantage of Python's typing syntax to reduce runtime errors through linting tools and verifying types at runtime. Thus, effectively enhancing stability for large ETL pipelines.
To accomplish this task we provide some tools:
pysetl.config
: Type-safe configuration.pysetl.storage
: Agnostic and extensible data sources connections.pysetl.workflow
: Pipeline management and dependency injection.
PySetl is designed with Python typing syntax at its core. Hence, we strongly suggest typedspark and pydantic for development.
- Model complex data pipelines.
- Reduce risks at production with type-safe development.
- Improve large project structure and readability.
PySetl is available in PyPI:
pip install pysetl
PySetl doesn't list pyspark as a dependency since most environments have their own Spark environment. Nevertheless, you can install pyspark running:
pip install "pysetl[pyspark]"
PySetl is a port from SETL. We want to fully recognize this package is heavily inspired by the work of the SETL team. We just adapted things to work in Python.