**DeeQu**, an open source data quality library, addresses data quality monitoring requirements and
can scale to large datasets. DeeQu is built on top of Apache Spark to define "**unit test for data**"

With DeeQu, you can populate data quality metrics and define data quality rules easily

[**Current supported functionalities**](https://github.com/awslabs/python-deequ/blob/master/docs/checks.md)

[**Documentation**](https://pydeequ.readthedocs.io/_/downloads/en/latest/pdf/)

In [2]:
!apt-get install openjdk-11-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz
!tar xf spark-3.1.1-bin-hadoop2.7.tgz
!pip install -q findspark

In [3]:
!pwd

/content


In [4]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop2.7"

In [5]:
import findspark
findspark.init()

In [6]:
!pip install pydeequ

Collecting pydeequ
  Downloading pydeequ-1.4.0-py3-none-any.whl (37 kB)
Installing collected packages: pydeequ
Successfully installed pydeequ-1.4.0


In [7]:
!pip install pyspark==3.1.1

Collecting pyspark==3.1.1
  Downloading pyspark-3.1.1.tar.gz (212.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.3/212.3 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting py4j==0.10.9 (from pyspark==3.1.1)
  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 kB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.1.1-py2.py3-none-any.whl size=212767581 sha256=3b61f490bc9a30cc834e391a8ac1b018af5fe13c10268317a8d9d39089f346ef
  Stored in directory: /root/.cache/pip/wheels/a0/3f/72/8efd988f9ae041f051c75e6834cd92dd6d13a726e206e8b6f3
Successfully built pyspark
Installing collected packages: py4j, pyspark
  Attempting uninstall: py4j
    Found existing installation: py4j 0

In [8]:
import os
os.environ['SPARK_VERSION'] = '3.1.1'

In [9]:
from pyspark.sql import SparkSession, Row
import pydeequ
spark = (SparkSession
             .builder
             .config("spark.jars.packages", pydeequ.deequ_maven_coord)
             .config("spark.jars.excludes", pydeequ.f2j_maven_coord)
             .getOrCreate())

In [10]:
df = spark.sparkContext.parallelize([
    Row(a="foo", b=1, c=5, d=10, e=None, f=0),
    Row(a="bar", b=2, c=6, d=4, e= 12, f=90),
    Row(a="baz", b=3, c=None, d=None, e = 20, f= -10),
    Row(a="cab", b=3, c=8,  d=None, e =None, f=50)]).toDF()

In [11]:
df.show()

+---+---+----+----+----+---+
|  a|  b|   c|   d|   e|  f|
+---+---+----+----+----+---+
|foo|  1|   5|  10|null|  0|
|bar|  2|   6|   4|  12| 90|
|baz|  3|null|null|  20|-10|
|cab|  3|   8|null|null| 50|
+---+---+----+----+----+---+



In [12]:
from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Warning, "Basic Check")

# **Size Validation**

**hasSize**(assertion, hint=None)--

Creates a constraint that calculates the data frame size (number of rows) and runs the assertion(lambda) on it

In [13]:
checkResult = VerificationSuite(spark) \
 .onData(df) \
 .addCheck(check.hasSize(lambda x: x == 4).hasSize(lambda x:x<=2))\
 .run()

Python Callback server started!


In [14]:
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)

+-----------+-----------+------------+--------------------------+-----------------+--------------------------------------------------+
|check      |check_level|check_status|constraint                |constraint_status|constraint_message                                |
+-----------+-----------+------------+--------------------------+-----------------+--------------------------------------------------+
+-----------+-----------+------------+--------------------------+-----------------+--------------------------------------------------+



In [15]:
from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Warning, "Basic Check")
checkResult = VerificationSuite(spark) \
 .onData(df) \
 .addCheck(check.hasSize(lambda x: x == 5))\
 .run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)

+-----------+-----------+------------+--------------------------+-----------------+--------------------------------------------------+
|check      |check_level|check_status|constraint                |constraint_status|constraint_message                                |
+-----------+-----------+------------+--------------------------+-----------------+--------------------------------------------------+
+-----------+-----------+------------+--------------------------+-----------------+--------------------------------------------------+



# **Completeness Validation**

**isComplete**(column, hint=None)--

Creates a constraint that asserts on a column completion.

In [16]:
df.show()

+---+---+----+----+----+---+
|  a|  b|   c|   d|   e|  f|
+---+---+----+----+----+---+
|foo|  1|   5|  10|null|  0|
|bar|  2|   6|   4|  12| 90|
|baz|  3|null|null|  20|-10|
|cab|  3|   8|null|null| 50|
+---+---+----+----+----+---+



In [17]:
from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Warning, "Basic Check")
checkResult = VerificationSuite(spark) \
 .onData(df) \
 .addCheck(check.isComplete('a'))\
 .run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)

+-----------+-----------+------------+-------------------------------------------------+-----------------+------------------+
|check      |check_level|check_status|constraint                                       |constraint_status|constraint_message|
+-----------+-----------+------------+-------------------------------------------------+-----------------+------------------+
+-----------+-----------+------------+-------------------------------------------------+-----------------+------------------+



In [18]:
df.show()

+---+---+----+----+----+---+
|  a|  b|   c|   d|   e|  f|
+---+---+----+----+----+---+
|foo|  1|   5|  10|null|  0|
|bar|  2|   6|   4|  12| 90|
|baz|  3|null|null|  20|-10|
|cab|  3|   8|null|null| 50|
+---+---+----+----+----+---+



In [19]:
from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Warning, "Basic Check")
checkResult = VerificationSuite(spark) \
 .onData(df) \
 .addCheck(check.isComplete('c'))\
 .run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)

+-----------+-----------+------------+-------------------------------------------------+-----------------+-----------------------------------------------------+
|check      |check_level|check_status|constraint                                       |constraint_status|constraint_message                                   |
+-----------+-----------+------------+-------------------------------------------------+-----------------+-----------------------------------------------------+
+-----------+-----------+------------+-------------------------------------------------+-----------------+-----------------------------------------------------+



In [20]:
df.show()

+---+---+----+----+----+---+
|  a|  b|   c|   d|   e|  f|
+---+---+----+----+----+---+
|foo|  1|   5|  10|null|  0|
|bar|  2|   6|   4|  12| 90|
|baz|  3|null|null|  20|-10|
|cab|  3|   8|null|null| 50|
+---+---+----+----+----+---+



In [21]:
from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Warning, "Basic Check")
checkResult = VerificationSuite(spark) \
 .onData(df) \
 .addCheck(check.isComplete('d'))\
 .run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)

+-----------+-----------+------------+-------------------------------------------------+-----------------+----------------------------------------------------+
|check      |check_level|check_status|constraint                                       |constraint_status|constraint_message                                  |
+-----------+-----------+------------+-------------------------------------------------+-----------------+----------------------------------------------------+
+-----------+-----------+------------+-------------------------------------------------+-----------------+----------------------------------------------------+



**areComplete**(column, hint=None)--

Creates a constraint that asserts completion in combined set of columns.

In [22]:
df.show()

+---+---+----+----+----+---+
|  a|  b|   c|   d|   e|  f|
+---+---+----+----+----+---+
|foo|  1|   5|  10|null|  0|
|bar|  2|   6|   4|  12| 90|
|baz|  3|null|null|  20|-10|
|cab|  3|   8|null|null| 50|
+---+---+----+----+----+---+



In [23]:
from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Warning, "Basic Check")
checkResult = VerificationSuite(spark) \
 .onData(df) \
 .addCheck(check.areComplete(['a','b']))\
 .run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)

+-----------+-----------+------------+------------------------------------------------------------------------------------------------------------------+-----------------+------------------+
|check      |check_level|check_status|constraint                                                                                                        |constraint_status|constraint_message|
+-----------+-----------+------------+------------------------------------------------------------------------------------------------------------------+-----------------+------------------+
+-----------+-----------+------------+------------------------------------------------------------------------------------------------------------------+-----------------+------------------+



In [24]:
df.show()

+---+---+----+----+----+---+
|  a|  b|   c|   d|   e|  f|
+---+---+----+----+----+---+
|foo|  1|   5|  10|null|  0|
|bar|  2|   6|   4|  12| 90|
|baz|  3|null|null|  20|-10|
|cab|  3|   8|null|null| 50|
+---+---+----+----+----+---+



In [25]:
from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Warning, "Basic Check")
checkResult = VerificationSuite(spark) \
 .onData(df) \
 .addCheck(check.areComplete(['a','c']))\
 .run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)

+-----------+-----------+------------+------------------------------------------------------------------------------------------------------------------+-----------------+-----------------------------------------------------+
|check      |check_level|check_status|constraint                                                                                                        |constraint_status|constraint_message                                   |
+-----------+-----------+------------+------------------------------------------------------------------------------------------------------------------+-----------------+-----------------------------------------------------+
+-----------+-----------+------------+------------------------------------------------------------------------------------------------------------------+-----------------+-----------------------------------------------------+



In [26]:
df.show()

+---+---+----+----+----+---+
|  a|  b|   c|   d|   e|  f|
+---+---+----+----+----+---+
|foo|  1|   5|  10|null|  0|
|bar|  2|   6|   4|  12| 90|
|baz|  3|null|null|  20|-10|
|cab|  3|   8|null|null| 50|
+---+---+----+----+----+---+



In [27]:
from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Warning, "Basic Check")
checkResult = VerificationSuite(spark) \
 .onData(df) \
 .addCheck(check.areComplete(['a','d']))\
 .run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)

+-----------+-----------+------------+------------------------------------------------------------------------------------------------------------------+-----------------+----------------------------------------------------+
|check      |check_level|check_status|constraint                                                                                                        |constraint_status|constraint_message                                  |
+-----------+-----------+------------+------------------------------------------------------------------------------------------------------------------+-----------------+----------------------------------------------------+
+-----------+-----------+------------+------------------------------------------------------------------------------------------------------------------+-----------------+----------------------------------------------------+



**hasCompleteness**(column, hint=None)--

Creates a constraint that asserts column completion.

In [28]:
df.show()

+---+---+----+----+----+---+
|  a|  b|   c|   d|   e|  f|
+---+---+----+----+----+---+
|foo|  1|   5|  10|null|  0|
|bar|  2|   6|   4|  12| 90|
|baz|  3|null|null|  20|-10|
|cab|  3|   8|null|null| 50|
+---+---+----+----+----+---+



In [29]:
from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Warning, "Basic Check")
checkResult = VerificationSuite(spark) \
 .onData(df) \
 .addCheck(check.hasCompleteness('a',lambda x: x >= 0.8))\
 .run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)

+-----------+-----------+------------+-------------------------------------------------+-----------------+------------------+
|check      |check_level|check_status|constraint                                       |constraint_status|constraint_message|
+-----------+-----------+------------+-------------------------------------------------+-----------------+------------------+
+-----------+-----------+------------+-------------------------------------------------+-----------------+------------------+



In [30]:
df.show()

+---+---+----+----+----+---+
|  a|  b|   c|   d|   e|  f|
+---+---+----+----+----+---+
|foo|  1|   5|  10|null|  0|
|bar|  2|   6|   4|  12| 90|
|baz|  3|null|null|  20|-10|
|cab|  3|   8|null|null| 50|
+---+---+----+----+----+---+



In [31]:
from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Warning, "Basic Check")
checkResult = VerificationSuite(spark) \
 .onData(df) \
 .addCheck(check.hasCompleteness('c',lambda x: x >= 0.8))\
 .run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)

+-----------+-----------+------------+-------------------------------------------------+-----------------+-----------------------------------------------------+
|check      |check_level|check_status|constraint                                       |constraint_status|constraint_message                                   |
+-----------+-----------+------------+-------------------------------------------------+-----------------+-----------------------------------------------------+
+-----------+-----------+------------+-------------------------------------------------+-----------------+-----------------------------------------------------+



In [32]:
df.show()

+---+---+----+----+----+---+
|  a|  b|   c|   d|   e|  f|
+---+---+----+----+----+---+
|foo|  1|   5|  10|null|  0|
|bar|  2|   6|   4|  12| 90|
|baz|  3|null|null|  20|-10|
|cab|  3|   8|null|null| 50|
+---+---+----+----+----+---+



In [33]:
from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Warning, "Basic Check")
checkResult = VerificationSuite(spark) \
 .onData(df) \
 .addCheck(check.hasCompleteness('d',lambda x: x >= 0.5))\
 .run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)

+-----------+-----------+------------+-------------------------------------------------+-----------------+------------------+
|check      |check_level|check_status|constraint                                       |constraint_status|constraint_message|
+-----------+-----------+------------+-------------------------------------------------+-----------------+------------------+
+-----------+-----------+------------+-------------------------------------------------+-----------------+------------------+



**areAnyComplete**(column, hint=None)--

Creates a constraint that asserts any completion in the combined set of columns

In [36]:
df.show()

+---+---+----+----+----+---+
|  a|  b|   c|   d|   e|  f|
+---+---+----+----+----+---+
|foo|  1|   5|  10|null|  0|
|bar|  2|   6|   4|  12| 90|
|baz|  3|null|null|  20|-10|
|cab|  3|   8|null|null| 50|
+---+---+----+----+----+---+



In [37]:
from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Warning, "Basic Check")
checkResult = VerificationSuite(spark) \
 .onData(df) \
 .addCheck(check.areAnyComplete(['a','d']))\
 .run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)

+-----------+-----------+------------+------------------------------------------------------------------------------------------------------------+-----------------+------------------+
|check      |check_level|check_status|constraint                                                                                                  |constraint_status|constraint_message|
+-----------+-----------+------------+------------------------------------------------------------------------------------------------------------+-----------------+------------------+
+-----------+-----------+------------+------------------------------------------------------------------------------------------------------------+-----------------+------------------+



In [38]:
df.show()

+---+---+----+----+----+---+
|  a|  b|   c|   d|   e|  f|
+---+---+----+----+----+---+
|foo|  1|   5|  10|null|  0|
|bar|  2|   6|   4|  12| 90|
|baz|  3|null|null|  20|-10|
|cab|  3|   8|null|null| 50|
+---+---+----+----+----+---+



In [42]:
from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Warning, "Basic Check")
checkResult = VerificationSuite(spark) \
 .onData(df) \
 .addCheck(check.areAnyComplete(['c','d','a']))\
 .run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)

+-----------+-----------+------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------+------------------+
|check      |check_level|check_status|constraint                                                                                                                          |constraint_status|constraint_message|
+-----------+-----------+------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------+------------------+
+-----------+-----------+------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------+------------------+



In [40]:
df.show()

+---+---+----+----+----+---+
|  a|  b|   c|   d|   e|  f|
+---+---+----+----+----+---+
|foo|  1|   5|  10|null|  0|
|bar|  2|   6|   4|  12| 90|
|baz|  3|null|null|  20|-10|
|cab|  3|   8|null|null| 50|
+---+---+----+----+----+---+



In [41]:
from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Warning, "Basic Check")
checkResult = VerificationSuite(spark) \
 .onData(df) \
 .addCheck(check.areAnyComplete(['c','e']))\
 .run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)

+-----------+-----------+------------+------------------------------------------------------------------------------------------------------------+-----------------+------------------+
|check      |check_level|check_status|constraint                                                                                                  |constraint_status|constraint_message|
+-----------+-----------+------------+------------------------------------------------------------------------------------------------------------+-----------------+------------------+
+-----------+-----------+------------+------------------------------------------------------------------------------------------------------------+-----------------+------------------+



# **Duplicates Validation**

**isUnique**(column, hint=None)--

Creates a constraint that asserts on a column uniqueness


In [13]:
df.show()

+---+---+----+----+----+---+
|  a|  b|   c|   d|   e|  f|
+---+---+----+----+----+---+
|foo|  1|   5|  10|null|  0|
|bar|  2|   6|   4|  12| 90|
|baz|  3|null|null|  20|-10|
|cab|  3|   8|null|null| 50|
+---+---+----+----+----+---+



In [14]:
from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Warning, "Basic Check")
checkResult = VerificationSuite(spark) \
 .onData(df) \
 .addCheck(check.isUnique('a'))\
 .run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)

+-----------+-----------+------------+---------------------------------------------------+-----------------+------------------+
|check      |check_level|check_status|constraint                                         |constraint_status|constraint_message|
+-----------+-----------+------------+---------------------------------------------------+-----------------+------------------+
+-----------+-----------+------------+---------------------------------------------------+-----------------+------------------+



In [15]:
df.show()

+---+---+----+----+----+---+
|  a|  b|   c|   d|   e|  f|
+---+---+----+----+----+---+
|foo|  1|   5|  10|null|  0|
|bar|  2|   6|   4|  12| 90|
|baz|  3|null|null|  20|-10|
|cab|  3|   8|null|null| 50|
+---+---+----+----+----+---+



In [16]:
from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Warning, "Basic Check")
checkResult = VerificationSuite(spark) \
 .onData(df) \
 .addCheck(check.isUnique('b'))\
 .run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)

+-----------+-----------+------------+---------------------------------------------------+-----------------+----------------------------------------------------+
|check      |check_level|check_status|constraint                                         |constraint_status|constraint_message                                  |
+-----------+-----------+------------+---------------------------------------------------+-----------------+----------------------------------------------------+
+-----------+-----------+------------+---------------------------------------------------+-----------------+----------------------------------------------------+



In [17]:
df.show()

+---+---+----+----+----+---+
|  a|  b|   c|   d|   e|  f|
+---+---+----+----+----+---+
|foo|  1|   5|  10|null|  0|
|bar|  2|   6|   4|  12| 90|
|baz|  3|null|null|  20|-10|
|cab|  3|   8|null|null| 50|
+---+---+----+----+----+---+



In [18]:
from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Warning, "Basic Check")
checkResult = VerificationSuite(spark) \
 .onData(df) \
 .addCheck(check.isUnique('d'))\
 .run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)

+-----------+-----------+------------+---------------------------------------------------+-----------------+------------------+
|check      |check_level|check_status|constraint                                         |constraint_status|constraint_message|
+-----------+-----------+------------+---------------------------------------------------+-----------------+------------------+
+-----------+-----------+------------+---------------------------------------------------+-----------------+------------------+



**hasUniqueness**(column, hint=None)--

Creates a constraint that asserts any uniqueness in a single or combined set of key columns.


In [19]:
df.show()

+---+---+----+----+----+---+
|  a|  b|   c|   d|   e|  f|
+---+---+----+----+----+---+
|foo|  1|   5|  10|null|  0|
|bar|  2|   6|   4|  12| 90|
|baz|  3|null|null|  20|-10|
|cab|  3|   8|null|null| 50|
+---+---+----+----+----+---+



In [20]:
from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Warning, "Basic Check")
checkResult = VerificationSuite(spark) \
 .onData(df) \
 .addCheck(check.hasUniqueness(['a'],lambda x : x > 0.75))\
 .run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)

Python Callback server started!
+-----------+-----------+------------+--------------------------------------------------------+-----------------+------------------+
|check      |check_level|check_status|constraint                                              |constraint_status|constraint_message|
+-----------+-----------+------------+--------------------------------------------------------+-----------------+------------------+
+-----------+-----------+------------+--------------------------------------------------------+-----------------+------------------+



In [21]:
df.show()

+---+---+----+----+----+---+
|  a|  b|   c|   d|   e|  f|
+---+---+----+----+----+---+
|foo|  1|   5|  10|null|  0|
|bar|  2|   6|   4|  12| 90|
|baz|  3|null|null|  20|-10|
|cab|  3|   8|null|null| 50|
+---+---+----+----+----+---+



In [22]:
from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Warning, "Basic Check")
checkResult = VerificationSuite(spark) \
 .onData(df) \
 .addCheck(check.hasUniqueness(['b'],lambda x : x > 0.75))\
 .run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)

+-----------+-----------+------------+--------------------------------------------------------+-----------------+----------------------------------------------------+
|check      |check_level|check_status|constraint                                              |constraint_status|constraint_message                                  |
+-----------+-----------+------------+--------------------------------------------------------+-----------------+----------------------------------------------------+
+-----------+-----------+------------+--------------------------------------------------------+-----------------+----------------------------------------------------+



**hasMin**(column, hint=None)--

Creates a constraint that asserts on the minimum of a column.


In [23]:
df.show()

+---+---+----+----+----+---+
|  a|  b|   c|   d|   e|  f|
+---+---+----+----+----+---+
|foo|  1|   5|  10|null|  0|
|bar|  2|   6|   4|  12| 90|
|baz|  3|null|null|  20|-10|
|cab|  3|   8|null|null| 50|
+---+---+----+----+----+---+



In [25]:
from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Warning, "Basic Check")
checkResult = VerificationSuite(spark) \
 .onData(df) \
 .addCheck(check.hasMin('b',lambda x : x <= 2))\
 .run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)

+-----------+-----------+------------+---------------------------------------+-----------------+------------------+
|check      |check_level|check_status|constraint                             |constraint_status|constraint_message|
+-----------+-----------+------------+---------------------------------------+-----------------+------------------+
+-----------+-----------+------------+---------------------------------------+-----------------+------------------+



**hasMax**(column, hint=None)--

Creates a constraint that asserts on the maximum of the column

**hasMean**(column, assertion, hint=None)--

Creates a constraint that asserts on the mean of the column

**hasSum**(column, assertion, hint=None)--

Creates a constraint that asserts on the sum of the column


In [26]:
df.show()

+---+---+----+----+----+---+
|  a|  b|   c|   d|   e|  f|
+---+---+----+----+----+---+
|foo|  1|   5|  10|null|  0|
|bar|  2|   6|   4|  12| 90|
|baz|  3|null|null|  20|-10|
|cab|  3|   8|null|null| 50|
+---+---+----+----+----+---+



In [27]:
from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Warning, "Basic Check")
checkResult = VerificationSuite(spark) \
 .onData(df) \
 .addCheck(check.hasMin('b',lambda x : x == 1)\
 .hasMax('c',lambda x : x == 9)\
 .hasMean('d',lambda x : x == 2)\
 .hasSum('b',lambda x : x == 9)) \
 .run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)

+-----------+-----------+------------+---------------------------------------+-----------------+----------------------------------------------------+
|check      |check_level|check_status|constraint                             |constraint_status|constraint_message                                  |
+-----------+-----------+------------+---------------------------------------+-----------------+----------------------------------------------------+
+-----------+-----------+------------+---------------------------------------+-----------------+----------------------------------------------------+



**isNonNegative**(column, assertion=None, hint=None)--

Creates a constraint which asserts that a column contains no negative values.

**isPositive**(column, assertion=None, hint=None)
Creates a constraint which asserts that a column contains no negative values and is greater than 0.

In [28]:
df1 = spark.sparkContext.parallelize([
    Row(a=1, b=1),
    Row(a=0, b=-2),
    Row(a=2, b=3),
    Row(a=3, b=3)]).toDF()
df1.show()

+---+---+
|  a|  b|
+---+---+
|  1|  1|
|  0| -2|
|  2|  3|
|  3|  3|
+---+---+



In [29]:

from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Warning, "Basic Check")
checkResult = VerificationSuite(spark) \
 .onData(df1) \
 .addCheck(check.isPositive('a').isNonNegative('a').isPositive('b').isNonNegative('b')) \
 .run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)

+-----------+-----------+------------+-------------------------------------------------------------------------------------------------------------------+-----------------+-----------------------------------------------------+
|check      |check_level|check_status|constraint                                                                                                         |constraint_status|constraint_message                                   |
+-----------+-----------+------------+-------------------------------------------------------------------------------------------------------------------+-----------------+-----------------------------------------------------+
+-----------+-----------+------------+-------------------------------------------------------------------------------------------------------------------+-----------------+-----------------------------------------------------+



**isContainedIn**(column, allowed_values, assertion=None, hint=None)--

Asserts that every non-null value in a column is contained in a set of predefined values



In [30]:
df.show()

+---+---+----+----+----+---+
|  a|  b|   c|   d|   e|  f|
+---+---+----+----+----+---+
|foo|  1|   5|  10|null|  0|
|bar|  2|   6|   4|  12| 90|
|baz|  3|null|null|  20|-10|
|cab|  3|   8|null|null| 50|
+---+---+----+----+----+---+



In [31]:
from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Warning, "Basic Check")
checkResult = VerificationSuite(spark) \
 .onData(df) \
 .addCheck(check.isContainedIn("a", ["foo", "bar", "baz"]))\
 .run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)

+-----------+-----------+------------+------------------------------------------------------------------------------------------------------------------------+-----------------+-----------------------------------------------------+
|check      |check_level|check_status|constraint                                                                                                              |constraint_status|constraint_message                                   |
+-----------+-----------+------------+------------------------------------------------------------------------------------------------------------------------+-----------------+-----------------------------------------------------+
+-----------+-----------+------------+------------------------------------------------------------------------------------------------------------------------+-----------------+-----------------------------------------------------+



In [32]:
df.show()

+---+---+----+----+----+---+
|  a|  b|   c|   d|   e|  f|
+---+---+----+----+----+---+
|foo|  1|   5|  10|null|  0|
|bar|  2|   6|   4|  12| 90|
|baz|  3|null|null|  20|-10|
|cab|  3|   8|null|null| 50|
+---+---+----+----+----+---+



In [33]:
from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Warning, "Basic Check")
checkResult = VerificationSuite(spark) \
 .onData(df) \
 .addCheck(check.isContainedIn("a", ["foo", "bar", "baz", "cab"]))\
 .run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)

+-----------+-----------+------------+----------------------------------------------------------------------------------------------------------------------------------+-----------------+------------------+
|check      |check_level|check_status|constraint                                                                                                                        |constraint_status|constraint_message|
+-----------+-----------+------------+----------------------------------------------------------------------------------------------------------------------------------+-----------------+------------------+
+-----------+-----------+------------+----------------------------------------------------------------------------------------------------------------------------------+-----------------+------------------+



In [36]:
df2 = spark.sparkContext.parallelize([
    Row(a="foo", b=1, c=5, d=10, e=None, f=0),
    Row(a="bar", b=2, c=6, d=4, e= 12, f=90),
    Row(a="baz", b=3, c=None, d=None, e = 20, f= -10),
    Row(a="cab", b=3, c=8,  d=None, e =None, f=50),
    Row(a=None, b=3, c=8,  d=None, e =None, f=50)]).toDF()
df2.show()

+----+---+----+----+----+---+
|   a|  b|   c|   d|   e|  f|
+----+---+----+----+----+---+
| foo|  1|   5|  10|null|  0|
| bar|  2|   6|   4|  12| 90|
| baz|  3|null|null|  20|-10|
| cab|  3|   8|null|null| 50|
|null|  3|   8|null|null| 50|
+----+---+----+----+----+---+



In [37]:
from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Warning, "Basic Check")
checkResult = VerificationSuite(spark) \
 .onData(df) \
 .addCheck(check.isContainedIn("a", ["foo", "bar", "baz", "cab"]))\
 .run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)

+-----------+-----------+------------+----------------------------------------------------------------------------------------------------------------------------------+-----------------+------------------+
|check      |check_level|check_status|constraint                                                                                                                        |constraint_status|constraint_message|
+-----------+-----------+------------+----------------------------------------------------------------------------------------------------------------------------------+-----------------+------------------+
+-----------+-----------+------------+----------------------------------------------------------------------------------------------------------------------------------+-----------------+------------------+



**containsURL**(column, assertion=None, hint=None)--

Verifies against a URL pattern



In [49]:
df3 = spark.createDataFrame([
 (1, "Product A", "awesome thing.", "high", 2),
 (2, "Product B", "available at http://producta.example.com", None, 0),
 (3, None, None, "medium", 6),
 (4, "Product D", "checkout https://productd.example.org", "low", 10),
 (5, "Product E", None, "high", 18)],
['id', 'productName', 'description', 'priority', 'numViews'])
df3.show(truncate=False)

+---+-----------+----------------------------------------+--------+--------+
|id |productName|description                             |priority|numViews|
+---+-----------+----------------------------------------+--------+--------+
|1  |Product A  |awesome thing.                          |high    |2       |
|2  |Product B  |available at http://producta.example.com|null    |0       |
|3  |null       |null                                    |medium  |6       |
|4  |Product D  |checkout https://productd.example.org   |low     |10      |
|5  |Product E  |null                                    |high    |18      |
+---+-----------+----------------------------------------+--------+--------+



In [52]:
from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Warning, "Basic Check")
checkResult = VerificationSuite(spark) \
 .onData(df3) \
 .addCheck(check.containsURL("description", lambda x: x >=0.3))\
 .run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)

+-----------+-----------+------------+------------------------+-----------------+------------------+
|check      |check_level|check_status|constraint              |constraint_status|constraint_message|
+-----------+-----------+------------+------------------------+-----------------+------------------+
+-----------+-----------+------------+------------------------+-----------------+------------------+



**containsEmail**(column, assertion=None, hint=None)--

Verifies against a Email pattern


In [53]:
df4 = spark.createDataFrame([
 (1, "The email address is foo@example.com"),
 (2, "Mail at bar@example.com"),
 (3, None, ),
 (4, "Just use this foobar@baz.com")],
['id', 'check_for_mail'])
df4.show(truncate=False)

+---+------------------------------------+
|id |check_for_mail                      |
+---+------------------------------------+
|1  |The email address is foo@example.com|
|2  |Mail at bar@example.com             |
|3  |null                                |
|4  |Just use this foobar@baz.com        |
+---+------------------------------------+



In [54]:
from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Warning, "Basic Check")
checkResult = VerificationSuite(spark) \
 .onData(df4) \
 .addCheck(check.containsEmail("check_for_mail", lambda x: x >= 0.3))\
 .run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)

+-----------+-----------+------------+-----------------------------+-----------------+------------------+
|check      |check_level|check_status|constraint                   |constraint_status|constraint_message|
+-----------+-----------+------------+-----------------------------+-----------------+------------------+
+-----------+-----------+------------+-----------------------------+-----------------+------------------+



In [55]:
df4.show(truncate=False)

+---+------------------------------------+
|id |check_for_mail                      |
+---+------------------------------------+
|1  |The email address is foo@example.com|
|2  |Mail at bar@example.com             |
|3  |null                                |
|4  |Just use this foobar@baz.com        |
+---+------------------------------------+



In [57]:
from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Warning, "Basic Check")
checkResult = VerificationSuite(spark) \
 .onData(df4) \
 .addCheck(check.containsEmail("check_for_mail", lambda x: x <= 0.1))\
 .run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)

+-----------+-----------+------------+-----------------------------+-----------------+-----------------------------------------------------+
|check      |check_level|check_status|constraint                   |constraint_status|constraint_message                                   |
+-----------+-----------+------------+-----------------------------+-----------------+-----------------------------------------------------+
+-----------+-----------+------------+-----------------------------+-----------------+-----------------------------------------------------+



**hasPattern**(column, pattern, assertion=None, name=None, hint=None)
--

Matches the regex Pattern.


In [67]:
df.show()

+---+---+----+----+----+---+
|  a|  b|   c|   d|   e|  f|
+---+---+----+----+----+---+
|foo|  1|   5|  10|null|  0|
|bar|  2|   6|   4|  12| 90|
|baz|  3|null|null|  20|-10|
|cab|  3|   8|null|null| 50|
+---+---+----+----+----+---+



In [71]:
from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Warning, "Basic Check")
checkResult = VerificationSuite(spark) \
 .onData(df) \
 .addCheck(check.hasPattern("a",r"f*",lambda x:x>=0.5))\
 .run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)

+-----------+-----------+------------+-----------------------------+-----------------+-----------------------------------------------------+
|check      |check_level|check_status|constraint                   |constraint_status|constraint_message                                   |
+-----------+-----------+------------+-----------------------------+-----------------+-----------------------------------------------------+
+-----------+-----------+------------+-----------------------------+-----------------+-----------------------------------------------------+



**isGreaterThan**(columnA, columnB, assertion=None, hint=None)

Asserts that, in each row, the value of columnA is greater than the value of columnB

In [80]:
df5 = spark.sparkContext.parallelize([
    Row(a="foo", b=1, c=5, d=10, e=None, f=100),
    Row(a="bar", b=2, c=6, d=4, e= 12, f=90),
    Row(a="baz", b=3, c=None, d=None, e = 20, f= 10),
    Row(a="cab", b=3, c=8,  d=None, e =None, f=50)]).toDF()
df5.show()

+---+---+----+----+----+---+
|  a|  b|   c|   d|   e|  f|
+---+---+----+----+----+---+
|foo|  1|   5|  10|null|100|
|bar|  2|   6|   4|  12| 90|
|baz|  3|null|null|  20| 10|
|cab|  3|   8|null|null| 50|
+---+---+----+----+----+---+



In [86]:
from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Warning, "Basic Check")
checkResult = VerificationSuite(spark) \
 .onData(df5) \
 .addCheck(check.isGreaterThan("c","b"))\
 .run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)

+-----------+-----------+------------+--------------------------------------------------------------------------------+-----------------+-----------------------------------------------------+
|check      |check_level|check_status|constraint                                                                      |constraint_status|constraint_message                                   |
+-----------+-----------+------------+--------------------------------------------------------------------------------+-----------------+-----------------------------------------------------+
+-----------+-----------+------------+--------------------------------------------------------------------------------+-----------------+-----------------------------------------------------+



**hasDataType**(column, datatype: ConstrainableDataTypes, assertion=None, hint=None)--

Check to run against the fraction of rows that conform to the given data type

In [92]:
df5.show()

+---+---+----+----+----+---+
|  a|  b|   c|   d|   e|  f|
+---+---+----+----+----+---+
|foo|  1|   5|  10|null|100|
|bar|  2|   6|   4|  12| 90|
|baz|  3|null|null|  20| 10|
|cab|  3|   8|null|null| 50|
+---+---+----+----+----+---+



In [98]:
from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Warning, "Basic Check")
checkResult = VerificationSuite(spark) \
 .onData(df5) \
 .addCheck(check.hasDataType("c",ConstrainableDataTypes.Numeric))\
 .run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)

+-----------+-----------+------------+----------------------------------------------------------------------------+-----------------+------------------+
|check      |check_level|check_status|constraint                                                                  |constraint_status|constraint_message|
+-----------+-----------+------------+----------------------------------------------------------------------------+-----------------+------------------+
+-----------+-----------+------------+----------------------------------------------------------------------------+-----------------+------------------+

