# KLL Checks Example

Here is a basic example of running a `Check` with a [KLL Sketches](https://arxiv.org/abs/1603.05346). A sketch serves to represent the larger dataset at hand in a binned statistical representation. KLL Sketches are great because it’s a very compact quantile sketch with a lazy compaction scheme that still remains highly accurate. Furthermore, it's designed to work on streams of data (regardless of completeness) and not just when the entire dataset is static and known. 

We'll start by creating a Spark session and a small sample dataframe.

In [18]:
import pydeequ

import sagemaker_pyspark
from pyspark.sql import SparkSession, Row

classpath = ":".join(sagemaker_pyspark.classpath_jars()) # aws-specific jars

spark = (SparkSession
    .builder
    .config("spark.driver.extraClassPath", classpath)
    .config("spark.jars.packages", pydeequ.deequ_maven_coord)
    .config("spark.jars.excludes", pydeequ.f2j_maven_coord)
    .getOrCreate())

In [19]:
df = spark.sparkContext.parallelize([
    Row(idx=1, name="Thingy A", description="awesome thing.", rating="high", numViews=0),
    Row(idx=2, name="Thingy B", description="available at http://thingb.com", rating=None, numViews=0),
    Row(idx=3, name=None, description=None, rating="low", numViews=5),
    Row(idx=4, name="Thingy D", description="checkout https://thingd.ca", rating="low", numViews=10),
    Row(idx=5, name="Thingy E", description=None, rating="high", numViews=12)]).toDF()

## Let's construct the Verification runner for our KLL check! 

`hasSize` and `hasMax` are basic `Checks` that we've already covered in the [basic_example notebook](./basic_example.ipynb), so let's focus our attention on the `kllSketchSatisfies` constraint. Here, we are checking if our KLL sketch size is larger than 3. 

**However**, we're purposely passing in a KLL Parameter into the constraint saying that our sketch size is 2. So we should expect this to fail!

```KLLParameters( spark_session, sketch_size, shrinking_factor, n_buckets) ```

Below, you can see the 2 ways you can define the assertion to check for the shrinking factor and sketch size with the `kllSketchSatisfies` check. When you `apply(0)` you get the shrinking factor. When you `apply(1)` you get the sketch_size. 

```
x.parameters().apply(0) => shrinking_factor 
x.parameters().apply(1) => sketch_size 
```

In [30]:
from pydeequ.checks import *
from pydeequ.analyzers import KLLParameters
from pydeequ.verification import *

check = Check(spark, CheckLevel.Error, "KLL Checks")

result = VerificationSuite(spark) \
    .onData(df) \
    .addCheck(
        check.hasSize(lambda x: x == 5) \
        .hasMax("numViews", lambda x: x <= 10) \
        .kllSketchSatisfies("numViews", lambda x: x.parameters().apply(1) >= 3, 
                           KLLParameters(spark, 2, 0.64, 2))) \
    .run()

In [31]:
if result.status == "Success": 
    print('The data passed the test, everything is fine!')

else:
    print('We found errors in the data, the following constraints were not satisfied:')
    
    for check_json in result.checkResults:
        if check_json['constraint_status'] != "Success": 
            print(f"\t{check_json['constraint']} failed because: {check_json['constraint_message']}")

We found errors in the data, the following constraints were not satisfied:
	MaximumConstraint(Maximum(numViews,None)) failed because: Value: 12.0 does not meet the constraint requirement!
	kllSketchConstraint(KLLSketch(numViews,Some(KLLParameters(2,0.64,2)))) failed because: Value: BucketDistribution(List(BucketValue(0.0,6.0,3), BucketValue(6.0,12.0,2)),List(0.64, 2.0),[[D@4e4e48bd) does not meet the constraint requirement!


## Let's run it again without specifying the sketch size... 

When our datasets get bigger, we won't want to limit our sketch to 2 buckets. What happens if we just run with the default values for KLL Sketches? 

Let's naively check for the `shrinking_factor` to be 0 to fail the check so we can see its output! 

In [34]:
check = Check(spark, CheckLevel.Error, "KLL Checks")

result = VerificationSuite(spark) \
    .onData(df) \
    .addCheck(
        check.hasSize(lambda x: x == 5) \
        .hasMax("numViews", lambda x: x <= 10) \
        .kllSketchSatisfies("numViews", lambda x: x.parameters().apply(0) <= 0)) \
    .run()

In [35]:
if result.status == "Success": 
    print('The data passed the test, everything is fine!')

else:
    print('We found errors in the data, the following constraints were not satisfied:')
    
    for check_json in result.checkResults:
        if check_json['constraint_status'] != "Success": 
            print(f"\t{check_json['constraint']} failed because: {check_json['constraint_message']}")

We found errors in the data, the following constraints were not satisfied:
	MaximumConstraint(Maximum(numViews,None)) failed because: Value: 12.0 does not meet the constraint requirement!
	kllSketchConstraint(KLLSketch(numViews,None)) failed because: Value: BucketDistribution(List(BucketValue(0.0,0.12,2), BucketValue(0.12,0.24,0), BucketValue(0.24,0.36,0), BucketValue(0.36,0.48,0), BucketValue(0.48,0.6,0), BucketValue(0.6,0.72,0), BucketValue(0.72,0.84,0), BucketValue(0.84,0.96,0), BucketValue(0.96,1.08,0), BucketValue(1.08,1.2,0), BucketValue(1.2,1.32,0), BucketValue(1.32,1.44,0), BucketValue(1.44,1.56,0), BucketValue(1.56,1.68,0), BucketValue(1.68,1.8,0), BucketValue(1.8,1.92,0), BucketValue(1.92,2.04,0), BucketValue(2.04,2.16,0), BucketValue(2.16,2.28,0), BucketValue(2.28,2.4,0), BucketValue(2.4,2.52,0), BucketValue(2.52,2.64,0), BucketValue(2.64,2.76,0), BucketValue(2.76,2.88,0), BucketValue(2.88,3.0,0), BucketValue(3.0,3.12,0), BucketValue(3.12,3.24,0), BucketValue(3.24,3.36,0), B

## Now we've uncovered PyDeequ's default parameters for KLL Sketches!

The output above displays 100 buckets that the KLL sketch outputted with the following sketch size and shrinking factor. 

```
DEFAULT_SKETCH_SIZE = 2048
DEFAULT_SHRINKING_FACTOR = 0.64
MAXIMUM_ALLOWED_DETAIL_BINS = 100
```