UNDER ACTIVE DEVELOPMENT
NOTE: Tsumugi Shiraui is a chimera: a hybrid of Human and Gauna. She combines the chaotic power of Gauna with a Human intillegence and empathia. Like an original character of the Manga "Knights of Sidonia", this project aims to make a hybrid of very powerful but hard to learn and use Deequ Scala Library with a usability and simplicity of Spark Connect (PySpark Connect, Spark Connect Go, Spark Connect Rust, etc.).
The project's goal is to create a modern, SparkConnect-native wrapper for the elegant and high-performance Data Quality library, AWS Deequ. The PySpark Connect API is expected to be the primary focus, but the PySpark Classic API will also be maintained. Additionally, other thin clients such as connect-go
or connect-rs
will be supported.
While Amazon Deequ itself is well-maintained, the existing PyDeequ wrapper faces two main challenges:
- It relies on direct
py4j
calls to the underlying Deequ Scala library. This approach presents two issues: a) It cannot be used in any SparkConnect environment. b)py4j
is not well-suited for working with Scala code. For example, try creating anOption[Long]
from Python usingpy4j
to understand the complexity involved. - It suffers from a lack of maintenance, likely due to the issues mentioned in point 1. This can be seen in this GitHub issue.
- The current python-deequ implementation makes it impossible to call row-level results because
py4j.reflection.PythonProxyHandler
is not serializable. This problem is documented in this GitHub issue.
- Maintain
proto3
definitions of basic Deequ Scala structures (such asCheck
,Analyzer
,AnomalyDetectionStrategy
,VerificationSuite
,Constraint
, etc.); - Maintain a Scala SparkConnect Plugin that enables working with Deequ from any client;
- Maintain a Python client that provides a user-friendly API on top of classes generated by
protoc
; - Provide utils to enhance Deequ's server-side functionality by adding more syntactic sugar, while ensuring their maintenance remains on the client-side.
- Creating a replacement for Deequ is not the goal of this project. Similarly, forking the entire project is not intended. Deequ's core is well-maintained, and there are no compelling reasons to create an aggressive fork of it.
- Developing a low-code or zero-code Data Quality tool with YAML file configuration is not the project's objective. Currently, the focus is on providing a well-maintained and documented client API that can be used to create low-code tooling.
From a high-level perspective, Tsumugi implements three main components:
- Messages for Deequ Core's main data structures
- SparkConnect Plugin and utilities
- PySpark Connect and PySpark Classic thin client
The diagram below provides an overview of this concept:
The tsumugi-server/src/main/protobuf/
directory contains messages that define the main structures of the Deequ Scala library:
VerificationSuite
: This is the top-level Deequ object. For more details, refer tosuite.proto
.Analyzer
: This object is defined usingoneof
from a list of analyzers (includingCountDistinct
,Size
,Compliance
, etc.). For implementation details, seeanalyzers.proto
.AnomalyDetection
and its associated strategies. For more information, consultstrategies.proto
.Check
: This is defined usingConstraint
,CheckLevel
, and a description.Constraint
: This is defined as an Analyzer (which computes a metric), a reference value, and a comparison sign.
The file tsumugi-server/src/main/scala/org/apache/spark/sql/tsumugi/DeequConnectPlugin.scala
contains the plugin code itself. It is designed to be very simple, consisting of approximately 50 lines of code. The plugin's functionality is straightforward: it checks if the message is a VerificationSuite
, passes it into DeequSuiteBuilder
, and then packages the result back into a Relation
.
The file tsumugi-server/src/main/scala/io/mrpowers/tsumugi/DeequSuiteBuilder.scala
contains code that creates Deequ objects from protobuf messages. It maps enums and constants to their corresponding Deequ counterparts, and generates com.amazon.deequ
objects from the respective protobuf messages. The code ultimately returns a ready-to-use Deequ top-level structure.
At the moment there are no package distributions of the server part as well there is no pre-built PyPi packages for clients. The only way to play with the project at the moment is to build it from the source code.
There is a simple Python script that performs the following tasks:
- Builds the server plugin;
- Downloads the required Spark version and all missing JAR files;
- Combines everything together;
- Runs the local Spark Connect Server with the Tsumugi plugin.
python dev/run-connect.py
Building the server component requires Maven and Java 11. You can find installation instructions for both in their official documentation: Maven and Java 11. This script also requires Python 3.10 or higher. After installation, you can connect to the server and test it by creating a Python virtual environment. This process requires the poetry
build tool. You can find instructions on how to install Poetry on their official website.
cd tsumugi-python
poetry env use python3.10 # any version bigger than 3.10 should work
poetry install --with dev # that install tsumugi as well as jupyter notebooks and pyspark[connect]
Now you can run jupyter and try the example notebook (tsumugi-python/examples/basic_example.ipynb
): Notebook
Building the server part requires Maven.
cd tsumugi-server
mvn clean package
Installing the PySpark client requires poetry
.
cd tsumugi-python
poetry env use python3.10 # 3.10+
poetry install
Tsumugi is built on top of Deequ Data Quality tool:
- Schelter, Sebastian, et al. "Automating large-scale data quality verification." Proceedings of the VLDB Endowment 11.12 (2018): 1781-1794., link
- Schelter, Sebastian, et al. "Unit testing data with deequ." Proceedings of the 2019 International Conference on Management of Data. 2019., link
- Schelter, Sebastian, et al. "Deequ-data quality validation for machine learning pipelines." (2018)., link