Skip to content

JhossePaul/pysetl

Repository files navigation

PySetl - A PySpark ETL Framework

PyPI Badge Build Status Code Coverage Documentation Status

Overview

PySetl is a framework to improve the readability and structure of PySpark ETL projects. Also, it is designed to take advantage of Python's typing syntax to reduce runtime errors through linting tools and verifying types at runtime. Thus, effectively enhancing stability for large ETL pipelines.

To accomplish this task we provide some tools:

  • pysetl.config: Type-safe configuration.
  • pysetl.storage: Agnostic and extensible data sources connections.
  • pysetl.workflow: Pipeline management and dependency injection.

PySetl is designed with Python typing syntax at its core. Hence, we strongly suggest typedspark and pydantic for development.

Why use PySetl?

  • Model complex data pipelines.
  • Reduce risks at production with type-safe development.
  • Improve large project structure and readability.

Installation

PySetl is available in PyPI:

pip install pysetl

PySetl doesn't list pyspark as a dependency since most environments have their own Spark environment. Nevertheless, you can install pyspark running:

pip install "pysetl[pyspark]"

Acknowledgments

PySetl is a port from SETL. We want to fully recognize this package is heavily inspired by the work of the SETL team. We just adapted things to work in Python.