<a href="https://colab.research.google.com/github/JonatanPolanco/Data_Quality_Testing/blob/main/PyDeequ_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Instalación de librerias**

In [None]:
!pip install pydeequ==1.0.1

Collecting pydeequ==1.0.1
  Downloading pydeequ-1.0.1-py3-none-any.whl (36 kB)
Installing collected packages: pydeequ
Successfully installed pydeequ-1.0.1


In [None]:
 !pip install pyspark==3.0.3

Collecting pyspark==3.0.3
  Downloading pyspark-3.0.3.tar.gz (209.1 MB)
[K     |████████████████████████████████| 209.1 MB 74 kB/s 
[?25hCollecting py4j==0.10.9
  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 85.6 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.0.3-py2.py3-none-any.whl size=209435971 sha256=a7f35f2047039dbd82ee7c6cd1416330a19f805cfd621bd01edb3b34879b52f1
  Stored in directory: /root/.cache/pip/wheels/7e/6d/0a/6b0bf301bc056d9af03194b732b9f49ad2fceb205aab2984fd
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.0.3


**Configuración de sesión de PySpark**

In [2]:
from pyspark.sql import SparkSession, Row
import pydeequ
import pandas as pd

spark = (SparkSession
    .builder
    .config("spark.jars.packages", pydeequ.deequ_maven_coord)
    .config("spark.jars.excludes", pydeequ.f2j_maven_coord)
    .getOrCreate())

**Cargando data**

In [3]:
df = spark.sparkContext.parallelize([
            Row(a="https://www.vldb.org/pvldb/vol11/p1781-schelter.pdf", b=1, c="jobici8705@gmail"),
            Row(a="https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/", b=2, c="jonatan@outlook.es"),
            Row(a="https://pydeequ.readthedocs.io/_/downloads/en/latest/pdf/", b=3, c='jobici8705@')]).toDF()

**Visualizar data**

In [29]:
df__ = df.toPandas()
df__.head()

Unnamed: 0,a,b,c
0,https://www.vldb.org/pvldb/vol11/p1781-schelte...,1,jobici8705@gmail
1,https://aws.amazon.com/blogs/big-data/test-dat...,2,jonatan@outlook.es
2,https://pydeequ.readthedocs.io/_/downloads/en/...,3,jobici8705@


**Analizadores AWS Deequ**

In [None]:
from pydeequ.analyzers import *

analysisResult = AnalysisRunner(spark) \
                    .onData(df) \
                    .addAnalyzer(Size()) \
                    .addAnalyzer(Completeness("b")) \
                    .addAnalyzer(Completeness("c")) \
                    .run()

analysisResult_df = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult)
analysisResult_df.show()



+-------+--------+------------+------------------+
| entity|instance|        name|             value|
+-------+--------+------------+------------------+
|Dataset|       *|        Size|               3.0|
| Column|       b|Completeness|               1.0|
| Column|       c|Completeness|0.6666666666666666|
+-------+--------+------------+------------------+



**Perfilamiento**

In [None]:
from pydeequ.profiles import *

result = ColumnProfilerRunner(spark) \
    .onData(df) \
    .run()

for col, profile in result.profiles.items():
    print(profile)

StandardProfiles for column: a: {
    "completeness": 1.0,
    "approximateNumDistinctValues": 3,
    "dataType": "String",
    "isDataTypeInferred": false,
    "typeCounts": {
        "Boolean": 0,
        "Fractional": 0,
        "Integral": 0,
        "Unknown": 0,
        "String": 3
    },
    "histogram": [
        [
            "https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/",
            1,
            0.3333333333333333
        ],
        [
            "https://www.vldb.org/pvldb/vol11/p1781-schelter.pdf",
            1,
            0.3333333333333333
        ],
        [
            "baz",
            1,
            0.3333333333333333
        ]
    ]
}
NumericProfiles for column: b: {
    "completeness": 1.0,
    "approximateNumDistinctValues": 3,
    "dataType": "Integral",
    "isDataTypeInferred": false,
    "typeCounts": {},
    "histogram": [
        [
            "1",
            1,
            0.3333333333333333
        ],
        [
      

**Sugerencias de restricciones**

In [8]:
from pydeequ.suggestions import *
import json

suggestionResult = ConstraintSuggestionRunner(spark) \
             .onData(df) \
             .addConstraintRule(DEFAULT()) \
             .run()

# Constraint Suggestions in JSON format 
#print(json.dumps(suggestionResult["constraint_suggestions"], indent=2))   # column_name, description, rule_description, code_for_constraint
suggestion = []

for key, value in suggestionResult.items():
  for i in range(len(value)):
    suggestion.append(value[i])

suggestion = pd.json_normalize(suggestion, record_path= 'column_name')
suggestion_Df = pd.DataFrame(suggestion)
suggestion_Df

TypeError: ignored

**Verificación de restricciones**

In [32]:
from pyspark.sql.types import IntegerType
from pydeequ.checks import *
from pydeequ.verification import *

check = Check(spark, CheckLevel.Warning, "Review Check")

checkResult = VerificationSuite(spark) \
    .onData(df) \
    .addCheck(
        check.hasSize(lambda x: x >= 3) \
        .hasMin("b", lambda x: x == 0) \
        .hasDataType("b",ConstrainableDataTypes.Integral) \
        .isComplete("c")  \
        .isComplete("b")  \
        .isUnique("a")  \
        .isContainedIn("a", ["foo", "bar", "baz"]) \
        .isNonNegative("b") \
        .containsEmail("c") \
        .containsURL("a")) \
    .run()

print(f"Verification Run Status: {checkResult.status}")
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult, pandas=True)
checkResult_df



Unnamed: 0,check,check_level,check_status,constraint,constraint_status,constraint_message
0,Review Check,Warning,Warning,SizeConstraint(Size(None)),Success,
1,Review Check,Warning,Warning,"MinimumConstraint(Minimum(b,None))",Failure,Value: 1.0 does not meet the constraint requir...
2,Review Check,Warning,Warning,"AnalysisBasedConstraint(DataType(b,None),<func...",Success,
3,Review Check,Warning,Warning,"CompletenessConstraint(Completeness(c,None))",Success,
4,Review Check,Warning,Warning,"CompletenessConstraint(Completeness(b,None))",Success,
5,Review Check,Warning,Warning,"UniquenessConstraint(Uniqueness(List(a),None))",Success,
6,Review Check,Warning,Warning,ComplianceConstraint(Compliance(a contained in...,Failure,Value: 0.0 does not meet the constraint requir...
7,Review Check,Warning,Warning,ComplianceConstraint(Compliance(b is non-negat...,Success,
8,Review Check,Warning,Warning,containsEmail(c),Failure,Value: 0.3333333333333333 does not meet the co...
9,Review Check,Warning,Warning,containsURL(a),Success,


**Cargando nueva data (5M de registros)**

In [None]:
df_ = spark.read.csv('Hr5m.csv', inferSchema=True, header=True)