##Streaming Data Quality using AWS Deequ

This notebook uses the `Deequ` package from AWS to run analysis on a streaming data source, and to derive key quality metrics about the data. Deequ is able to provide a variety of quantitative statistics and metrics about a dataset, and has utilities to generate, track, and interpret these metrics. See [this Amazon official blog](https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/) or check the [GitHub repo](https://github.com/awslabs/deequ/) for more details.

For this notebook, we use structured streaming, combined with Delta tables and the Deequ package, to provide a live view of a dataset's "health".

We'll use several Deequ metrics in our analysis; some of these are explained below. The full list can be found in the above links.
- `ApproxCountDistinct`: returns the approximate count of distinct values in a column
- `Distinctness`: returns the fraction of (distinct values / total values) in a column
- `Completeness`: returns the fraction of values that are non-null in a column
- `Compliance`: returns the fraction of values in a column that meet a given constraint

_Note: this notebook requires the Deequ package; add the package from Maven Central using com.amazon.deequ. For Slack notifications, [spark-slack](https://github.com/MrPowers/spark-slack) or a similar package is required._

Before we begin, we need to do some cleanup; we'll also need to download some data.

In [0]:
%fs
mkdirs /tmp/StreamingDataQuality/

In [0]:
%sh
# clear the delta checkpoint
rm -rf /dbfs/tmp/StreamingDataQuality/checkpoint

# download some generated stock tick data; this is a public Mockaroo endpoint- as such, we can't guarantee availability!
curl "https://api.mockaroo.com/api/2aedaa80?count=1000&key=8eb06b50" > /dbfs/tmp/StreamingDataQuality/stockTicks.json

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 12428    0 12428    0     0  12378      0 --:--:--  0:00:01 --:--:-- 12390100 94851    0 94851    0     0  46367      0 --:--:--  0:00:02 --:--:-- 46381100  149k    0  149k    0     0  60533      0 --:--:--  0:00:02 --:--:-- 60555


In [0]:
# read the raw JSON, then repartition and write into a tmp parquet folder
spark.read.json("/tmp/StreamingDataQuality/stockTicks.json").repartition(100).write.mode("overwrite").parquet("/tmp/StreamingDataQuality/source/")

In [0]:
%fs ls /tmp/StreamingDataQuality/stockTicks.json

path,name,size,modificationTime
dbfs:/tmp/StreamingDataQuality/stockTicks.json,stockTicks.json,152903,1712688780360


First we'll set up our delta tables and any necessary temporary views, as well as importing the packages to be used.

In [0]:
%scala
import spark.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions.concat
import com.amazon.deequ.{VerificationSuite, VerificationResult}
import com.amazon.deequ.VerificationResult.checkResultsAsDataFrame
import com.amazon.deequ.checks.{Check, CheckLevel, CheckStatus}
import com.amazon.deequ.suggestions.{ConstraintSuggestionRunner, Rules}
import com.amazon.deequ.analyzers._
import com.amazon.deequ.analyzers.runners.AnalysisRunner
import com.amazon.deequ.analyzers.runners.AnalyzerContext.successMetricsAsDataFrame
import com.amazon.deequ.analyzers.{Analysis, ApproxCountDistinct, Completeness, Compliance, Distinctness, InMemoryStateProvider, Size}

In [0]:
%scala
import spark.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions.concat
import com.amazon.deequ.{VerificationSuite, VerificationResult}
import com.amazon.deequ.VerificationResult.checkResultsAsDataFrame
import com.amazon.deequ.checks.{Check, CheckLevel, CheckStatus}
import com.amazon.deequ.suggestions.{ConstraintSuggestionRunner, Rules}
import com.amazon.deequ.analyzers._
import com.amazon.deequ.analyzers.runners.AnalysisRunner
import com.amazon.deequ.analyzers.runners.AnalyzerContext.successMetricsAsDataFrame
import com.amazon.deequ.analyzers.{Analysis, ApproxCountDistinct, Completeness, Compliance, Distinctness, InMemoryStateProvider, Size}

val data_path = "/tmp/StreamingDataQuality/source/"
val checkpoint_path = "/tmp/StreamingDataQuality/checkpoint/"
val base_df = spark.read.parquet(data_path)
val empty_df = base_df.where("0 = 1")
val l1: Long = 0

spark.sql("DROP TABLE IF EXISTS trades_delta")
spark.sql("DROP TABLE IF EXISTS bad_records")
spark.sql("DROP TABLE IF EXISTS deequ_metrics")

base_df.createOrReplaceTempView("trades_historical")
empty_df.write.format("delta").saveAsTable("trades_delta")
empty_df.withColumn("batchID",lit(l1)).write.format("delta").saveAsTable("bad_records")
dbutils.fs.mkdirs(checkpoint_path)

First, we'll take a look at the suggested quality constraints that Deequ can automatically generate. Deequ will inspect the data you give it, and generate constraints that assume future data should look similar.

In [0]:
%scala
val suggestionResult = ConstraintSuggestionRunner()
  .onData(spark.sql("SELECT * FROM trades_historical"))
  .addConstraintRules(Rules.DEFAULT)
  .run()

suggestionResult.constraintSuggestions.foreach { case (column, suggestions) =>
  suggestions.foreach { suggestion =>
    println(s"Constraint suggestion for '$column':\t${suggestion.description}\n" +
      s"The corresponding scala code is ${suggestion.codeForConstraint}\n")
  }
}

Currently, Deequ leaves it to us to decide which of these constraints to actually use. We'll choose a few to run on our full dataset. We'll also set up a few other pieces provided by Deequ to hold our stateful metrics.

In [0]:
%scala
// create a stateStore to hold our stateful metrics
val stateStoreCurr = InMemoryStateProvider()
val stateStoreNext = InMemoryStateProvider()

// create the analyzer to run on the streaming data
val analysis = Analysis()
.addAnalyzer(Size())
.addAnalyzer(ApproxCountDistinct("symbol"))
.addAnalyzer(Distinctness("symbol"))
.addAnalyzer(Completeness("ipaddr"))
.addAnalyzer(Completeness("quantity"))
.addAnalyzer(Completeness("price"))
.addAnalyzer(Compliance("top quantity", "quantity >= 0"))

Now that everything is in place, we can run the stream to populate our delta table. 

Note that before running this cell, it is preferable to run the other streaming cells below first, so that they will consume all of the records from this producer.

In [0]:
%scala
// parse the schema for the source parquet
val schema = base_df.schema

// start the stream
spark.readStream
.schema(schema)
.format("parquet")
.option("maxFilesPerTrigger",1)
.load(data_path)
.writeStream.format("delta")
.option("failOnDataLoss", false)
.option("checkpointLocation", checkpoint_path)
.format("delta").table("trades_delta")

We now need to read the delta table we just created, so that we can apply the Deequ analysis to this data. To do this, we first read the previous delta table as a stream, and then use foreachBatch to do the following:
- Set up the stateStores
- Run our analysis on the current batch
- Run a unit validation on the current batch
- If unit verification fails, add the batch to the bad records table
- Update the metrics table with the current batch

This cell writes to two tables: bad_records (which contains records from any batch that fails validation) and deequ_metrics (which contains the latest aggregated metrics from all streaming records).

In [0]:
%scala
// read the delta table and analyze
spark.readStream
.format("delta")
.table("trades_delta")
.writeStream
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
  
  // reassign our current state to the previous next state
  val stateStoreCurr = stateStoreNext
  
  // run our analysis on the current batch, aggregate with saved state
  val metricsResult = AnalysisRunner.run(
    data = batchDF,
    analysis = analysis,
    aggregateWith = Some(stateStoreCurr),
    saveStatesWith = Some(stateStoreNext))
  
  // verify critical metrics for this microbatch i.e., trade quantity, ipaddr not null, etc.
  val verificationResult = VerificationSuite()
  .onData(batchDF)
  .addCheck(
    Check(CheckLevel.Error, "unitTest")
      .hasMax("quantity", _ <= 10000) // max is 10000
      .hasCompleteness("ipaddr", _ >= 0.95) // 95%+ non-null IPs
      .isNonNegative("quantity")) // should not contain negative values
    .run()
  
  // if verification fails, write batch to bad records table
  if (verificationResult.status != CheckStatus.Success) {
    batchDF.withColumn("batchID",lit(batchId))
    .write.format("delta").mode("append").saveAsTable("bad_records")
  }
  
  // get the current metrics as a dataframe
  val metric_results = successMetricsAsDataFrame(spark, metricsResult)
  .withColumn("ts", current_timestamp())
  
  // write the current results into the metrics table
  metric_results.write.format("delta").mode("Overwrite").saveAsTable("deequ_metrics")

}
.start()

Now, we can visualize the metrics. Note that because we are updating the table, we need to set `ignoreChanges` to `true`. This means each update of the metrics will be written as a duplicate entry; we can parse this out to only take the latest view, or we can use this to create a time series view of the data quality.

In [0]:
%scala
display(spark.readStream.format("delta")
        .option("ignoreChanges", "true")
        .table("deequ_metrics")
        .where($"name" === "Size" || $"name" === "ApproxCountDistinct"))

Databricks visualization. Run in Databricks to view.

entity,instance,name,value,ts
Column,symbol,ApproxCountDistinct,100.0,2024-04-09T18:56:17.48Z
Dataset,*,Size,100.0,2024-04-09T18:56:17.48Z
Column,symbol,ApproxCountDistinct,118.0,2024-04-09T18:56:39.526Z
Dataset,*,Size,120.0,2024-04-09T18:56:39.526Z
Column,symbol,ApproxCountDistinct,153.0,2024-04-09T18:56:59.399Z
Dataset,*,Size,160.0,2024-04-09T18:56:59.399Z
Column,symbol,ApproxCountDistinct,179.0,2024-04-09T18:57:22.035Z
Dataset,*,Size,190.0,2024-04-09T18:57:22.035Z
Column,symbol,ApproxCountDistinct,220.0,2024-04-09T18:57:45.848Z
Dataset,*,Size,230.0,2024-04-09T18:57:45.848Z


In [0]:
%scala
display(spark.readStream.format("delta")
        .option("ignoreChanges", "true")
        .table("deequ_metrics")
        .where($"name" === "Completeness" || $"name" === "Distinctness"))

entity,instance,name,value,ts
Column,quantity,Completeness,0.96,2024-04-09T18:56:17.48Z
Column,symbol,Distinctness,1.0,2024-04-09T18:56:17.48Z
Column,price,Completeness,0.98,2024-04-09T18:56:17.48Z
Column,ipaddr,Completeness,0.96,2024-04-09T18:56:17.48Z
Column,quantity,Completeness,0.9583333333333334,2024-04-09T18:56:39.526Z
Column,symbol,Distinctness,0.9916666666666668,2024-04-09T18:56:39.526Z
Column,price,Completeness,0.9833333333333332,2024-04-09T18:56:39.526Z
Column,ipaddr,Completeness,0.9666666666666668,2024-04-09T18:56:39.526Z
Column,quantity,Completeness,0.95625,2024-04-09T18:56:59.399Z
Column,symbol,Distinctness,0.99375,2024-04-09T18:56:59.399Z


Databricks visualization. Run in Databricks to view.

In [0]:
%scala
display(spark.readStream.format("delta")
        .option("ignoreChanges", "true")
        .table("deequ_metrics")
        .where($"name" === "Completeness" || $"name" === "Distinctness"))

entity,instance,name,value,ts
Column,quantity,Completeness,0.96,2024-04-09T18:56:17.48Z
Column,symbol,Distinctness,1.0,2024-04-09T18:56:17.48Z
Column,price,Completeness,0.98,2024-04-09T18:56:17.48Z
Column,ipaddr,Completeness,0.96,2024-04-09T18:56:17.48Z
Column,quantity,Completeness,0.9583333333333334,2024-04-09T18:56:39.526Z
Column,symbol,Distinctness,0.9916666666666668,2024-04-09T18:56:39.526Z
Column,price,Completeness,0.9833333333333332,2024-04-09T18:56:39.526Z
Column,ipaddr,Completeness,0.9666666666666668,2024-04-09T18:56:39.526Z
Column,quantity,Completeness,0.95625,2024-04-09T18:56:59.399Z
Column,symbol,Distinctness,0.99375,2024-04-09T18:56:59.399Z


Databricks visualization. Run in Databricks to view.

In [0]:
%scala
val batchCounts = spark.read.format("delta").table("bad_records")
.groupBy($"batchId").count().withColumnRenamed("batchId", "batchId2").withColumnRenamed("count", "total")

display(spark.read.format("delta").table("bad_records")
        .filter($"quantity" < 0 || $"quantity" > 10000 || $"ipaddr" === null)
        .groupBy($"batchId").count()
        .join(batchCounts, $"batchId2" === $"batchId", "inner")
        .withColumn("percent_bad", bround(lit(100)*$"count"/$"total",3))
        .drop("batchId2").orderBy(desc("percent_bad")))

batchId,count,total,percent_bad
3,3,20,15.0
8,4,30,13.333
17,5,50,10.0
4,4,40,10.0
14,4,40,10.0
23,6,60,10.0
20,4,50,8.0
13,3,40,7.5
5,2,30,6.667
1,2,30,6.667


In [0]:
%scala
display(spark.readStream.format("delta").table("bad_records")
        .filter($"quantity" < 0 || $"quantity" > 10000 || $"ipaddr" === null))

buysell,date,ipaddr,ordertype,price,quantity,symbol,time,batchID
sell,08/13/2019,91.109.140.52,bestLimit,41.5177,-1.0,CKH,0:52:48,1
sell,08/04/2019,148.244.172.177,oco,25.0386,-1.0,RRD,18:10:28,3
buy,08/19/2019,18.221.112.226,cmo,3.96,-1.0,DOOR,18:06:32,4
sell,08/27/2019,27.230.151.23,cross,22.9554,-1.0,MRK,9:47:07,4
buy,08/22/2019,175.207.13.251,oco,34.4796,-1.0,EVHC,9:46:14,5
sell,08/24/2019,154.4.144.87,market,,-1.0,CIE,8:48:38,5
buy,08/30/2019,112.18.43.14,oco,18.995,-1.0,F,12:06:03,1
buy,08/18/2019,185.120.6.137,cross,19.4768,-1.0,PLD,10:04:09,3
sell,08/30/2019,7.92.118.255,quote,3.52,-1.0,SR,18:08:02,3
sell,08/27/2019,189.196.57.127,marketToLimit,9.958,-1.0,STAA,5:49:49,4


In [0]:
%scala
val verificationResult: VerificationResult = { VerificationSuite()
  .onData(spark.sql("select * from trades_delta"))
  .addCheck(
    Check(CheckLevel.Error, "Review Check") 
      .hasMax("quantity", _ <= 10000) // max is 10000
      .hasCompleteness("quantity", _ >= 0.95) // should never be NULL
      .isUnique("ipaddr") // should not contain duplicates
      .hasCompleteness("ipaddr", _ >= 0.95)
      .isContainedIn("buysell", Array("buy","sell")) // contains only the listed values
      .isNonNegative("quantity")) // should not contain negative values
  .run()
}

// convert check results to a Spark data frame
val resultDataFrame = checkResultsAsDataFrame(spark, verificationResult)
display(resultDataFrame)

check,check_level,check_status,constraint,constraint_status,constraint_message
Review Check,Error,Error,"MaximumConstraint(Maximum(quantity,None))",Success,
Review Check,Error,Error,"CompletenessConstraint(Completeness(quantity,None))",Success,
Review Check,Error,Error,"UniquenessConstraint(Uniqueness(List(ipaddr),None))",Success,
Review Check,Error,Error,"CompletenessConstraint(Completeness(ipaddr,None))",Success,
Review Check,Error,Error,"ComplianceConstraint(Compliance(buysell contained in buy,sell,`buysell` IS NULL OR `buysell` IN ('buy','sell'),None,List(buysell)))",Success,
Review Check,Error,Error,"ComplianceConstraint(Compliance(quantity is non-negative,COALESCE(CAST(quantity AS DECIMAL(20,10)), 0.0) >= 0,None,List(quantity)))",Failure,Value: 0.944 does not meet the constraint requirement!


In [0]:
%scala
display(resultDataFrame)

check,check_level,check_status,constraint,constraint_status,constraint_message
Review Check,Error,Error,"MaximumConstraint(Maximum(quantity,None))",Success,
Review Check,Error,Error,"CompletenessConstraint(Completeness(quantity,None))",Success,
Review Check,Error,Error,"UniquenessConstraint(Uniqueness(List(ipaddr),None))",Success,
Review Check,Error,Error,"CompletenessConstraint(Completeness(ipaddr,None))",Success,
Review Check,Error,Error,"ComplianceConstraint(Compliance(buysell contained in buy,sell,`buysell` IS NULL OR `buysell` IN ('buy','sell'),None,List(buysell)))",Success,
Review Check,Error,Error,"ComplianceConstraint(Compliance(quantity is non-negative,COALESCE(CAST(quantity AS DECIMAL(20,10)), 0.0) >= 0,None,List(quantity)))",Failure,Value: 0.944 does not meet the constraint requirement!
