# Template for data validation with Great Expectations Checkpoints

This notebook is a template which shows you the main steps you have to follow in order to execute a validation run over an in-memory SparkDF using the Great Expectations `Checkpoint`.

The key steps presented in this notebook are the following:

1. Create the Spark Session and read the SparkDF
2. Set the Expectation Suite you want to use
3. Configure the Great Expectations Data Context <br/>
  3.1 Data Source configuration <br/>
  3.2 Expectation suites and Validation Results stores configuration <br/>
  3.3 Instantiate the Data Context
4. Get a batch of data and instantiate a Validator object
5. Validate your data with a `Checkpoint`

In [None]:
# import custom_expectations package
import sys
sys.path.append('../')

import custom_expectations

In [None]:
from pyspark.sql import SparkSession
import os
import datetime

from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from great_expectations.data_context.types.base import DataContextConfig
from great_expectations.exceptions import DataContextError
from great_expectations.data_context import BaseDataContext
from great_expectations.core.batch import RuntimeBatchRequest
from great_expectations.checkpoint import SimpleCheckpoint

## 1) Read data as SparkDF

Create a local SparkSession and read data from the path `../../data/` as SparkDF.

In [None]:
spark = SparkSession.builder.master('local').getOrCreate()

In [None]:
schema = StructType([
    StructField("video_id",StringType(),True),
    StructField("time_spent", IntegerType(), True),
    StructField("video_duration", IntegerType(), True),
    StructField("customer_id", StringType(), True),
    StructField("user_id", StringType(), True),
    StructField("device_id", StringType(), True)
  ])

In [None]:
df = spark.read.format("csv")\
    .option("sep", ",")\
    .option("nullValue", "*")\
    .option("header", "true")\
    .option("escape", "\"")\
    .schema(schema)\
    .load("../../data/sample_data.csv")

## 2) Set the Expectation Suite you want to use

Set the name of the Expectation Suite you want to use to evaluate the quality of your data.<br/>
The Expectation Suite should have been already generated and stored in the directory that you will provide below in the `ExpectationsStore` configuration.

(Check the notebook [template for the creation of an Expectations Suite](../suite_dev_notebooks/expectation_suite_template.ipynb))

In [None]:
table_name = "sample_data"
suite_name = "data_quality_check"
expectation_suite_name = table_name+"."+suite_name

## 3) Configure Great Expectations Data Context

Instantiate the Great Expectations Data Context based on the official guide: [_How to instantiate a data context without a yml file_](https://docs.greatexpectations.io/docs/guides/setup/configuring_data_contexts/how_to_instantiate_a_data_context_without_a_yml_file). <br />
In the [Data Context](https://docs.greatexpectations.io/docs/terms/data_context/) we have to define all the necessary information to create our expectation suite: 
- where to store the expectation suite,
- where to store the validation results,
- what engine to use (Pandas, Spark or SQLAlchemy),
- how to connect to your input data.

### 3.1) Data Source configuration

Since we are developing an [expectation suite over an in-memory SparkDF](https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/in_memory/spark), we configure datasource as follow:
- `SparkDFExecutionEngine` as _**execution_engine**_'s _class_name_
- `RuntimeDataConnector` as _**dataconnector**_

In [None]:
datasources={
    "filesystem_datasource": {
        "class_name": "Datasource",
        "module_name": "great_expectations.datasource",
        'execution_engine': {
            'module_name': 'great_expectations.execution_engine',
            'class_name': 'SparkDFExecutionEngine'
        },
        "data_connectors": {
            "runtime_data_connector": {
                "class_name": "RuntimeDataConnector",
                "batch_identifiers": ["batch_id"],
            },
        },
    }
}

### 3.2) Expectation suites and Validation Results stores configuration

In the Data Context we must specify the paths where to read the expectation suite and where to write the validation 
results. Paths can be related to either [Amazon S3](https://docs.greatexpectations.io/docs/guides/setup/configuring_metadata_stores/how_to_configure_an_expectation_store_in_amazon_s3) or [Local Filesystem](https://docs.greatexpectations.io/docs/guides/setup/configuring_metadata_stores/how_to_configure_an_expectation_store_on_a_filesystem). 
<br/>
In the following cells we show how to configure the expectations and validations stores for those two scenarios.

In [None]:
# Choose where to store the expectations: S3 or Local Filesystem
expectations_s3_store = {
    "class_name": "ExpectationsStore",
    "store_backend": {
        "class_name": "TupleS3StoreBackend",
        "bucket": "bucket-name",
        "prefix": "folder/name",
     }
}
expectations_filesystem_store = {
    "class_name": "ExpectationsStore",
    "store_backend": {
        "class_name": "TupleFilesystemStoreBackend",
        "base_directory": "/home/jovyan/work/expectation_suites",
    }
}

In [None]:
# Choose where to store the validation: S3 or Local Filesystem
validations_s3_store = {
    "class_name": "ValidationsStore",
    "store_backend": {
        "class_name": "TupleS3StoreBackend",
        "bucket": "bucket-name",
        "prefix": "folder/name",
     }
}
validations_filesystem_store = {
    "class_name": "ValidationsStore",
    "store_backend": {
        "class_name": "TupleFilesystemStoreBackend",
        "base_directory": "/home/jovyan/work/validations",
    }
}

### 3.3) Instantiate the Data Context

Finally we instantiate the `DataContextConfig` specifying the previously configured datasource, expectations store and validations store and setting the path where to store the Data Docs website (with `data_docs_sites`).

In [None]:
data_context_config = DataContextConfig(
    datasources=datasources,
    stores={
        "expectations_store": expectations_filesystem_store,
        "validations_store": validations_filesystem_store,
        "evaluation_parameter_store": {"class_name": "EvaluationParameterStore"},
    },
    expectations_store_name="expectations_store",
    validations_store_name="validations_store",
    evaluation_parameter_store_name="evaluation_parameter_store",
    checkpoint_store_name="checkpoint_store",
    data_docs_sites={
        "dq_website": {
            "class_name": "SiteBuilder",
            "store_backend": {
                "class_name": "TupleFilesystemStoreBackend",
                "base_directory": "/home/jovyan/work/site",
            },
            "site_index_builder": {
                "class_name": "DefaultSiteIndexBuilder",
                "show_cta_footer": False,
            },
        }
    },
    anonymous_usage_statistics={
      "enabled": False
    }
)

In [None]:
context = BaseDataContext(project_config=data_context_config)

## 4) Get a batch of data and instantiate a Validator object

Create a [`RuntimeBatchRequest`](https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/how_to_configure_a_runtimedataconnector/#example-2-runtimedataconnector-that-uses-an-in-memory-dataframe) where you specify which Batch of data you would like to check.<br/>
**Note**: in this case the batch is composed by the entire dataset previously read as SparkDF.

In [None]:
batch_request = RuntimeBatchRequest(
    datasource_name="filesystem_datasource",
    data_connector_name="runtime_data_connector",
    data_asset_name="data_asset_name",
    batch_identifiers={"batch_id": "something_something"},
    runtime_parameters={"batch_data": df},
)

In [None]:
validations = [
        {
            "batch_request": batch_request,
            "expectation_suite_name": expectation_suite_name
         }
    ]

## 5) Validate your data with Great Expectations Checkpoint

We are now ready to instantiate the [Checkpoint](https://docs.greatexpectations.io/docs/guides/validation/advanced/how_to_validate_data_with_an_in_memory_checkpoint) and run a data validation.


> **Note**: From Great Expectations 0.14.6 it is required that the _validations_ list variable (defined on the previous cell) must be passed as an input parameter to the `.run()` method. <br/>
While on the older versions (<= 0.14.5), the _validations_ list was passed directly during the Checkpoint instantiation.<br/>
>- [Pull Request #4166](https://github.com/great-expectations/great_expectations/pull/4166)
>- [Great Expectations Release 0.14.6](https://github.com/great-expectations/great_expectations/releases/tag/0.14.6)

In [None]:
checkpoint = SimpleCheckpoint(
    name="checkpoint",
    data_context=context,
    class_name="SimpleCheckpoint",
    action_list=[
    {
      "name": "store_validation_result",
      "action": {
        "class_name": "StoreValidationResultAction"
      }
    }
  ]
)

The Checkpoint method `.run()` starts to validate the batch of data by running the corresponding expectations suite (defined on the BaseDataContext). <br/>
The Validation Results output are: a `.json` file created under the path declared on the variable `validations_*_store` (previously defined in the context configuration) and the Data Docs updated with the current validation output.

In [None]:
run_id = {
  "run_name": table_name+"_"+suite_name+"_run",
  "run_time": datetime.datetime.now(datetime.timezone.utc)
}

checkpoint_result = checkpoint.run(
    run_id=run_id,
    run_name_template="%Y%m%d_%H%M%S",
    validations=validations,
    action_list=[
    {
      "name": "store_validation_result",
      "action": {
        "class_name": "StoreValidationResultAction"
      }
    }
  ]
)

If you just want to update the Data Docs to check how is rendered the latest Expectation you added to the Expectation Suite, check the [python script](../generate_data_doc) dedicated to the creation/update of the Great Expectations Data Docs.

#### Bonus: Validate your data with Great Expectations <= 0.14.5

Here how to instantiate a Checkpoint and run a data Validation with Great Expectations <= 0.14.5 (before [Pull Request #4166](https://github.com/great-expectations/great_expectations/pull/4166) was merged).

```
# Instantiate the Checkpoint
checkpoint = SimpleCheckpoint(
    name="checkpoint",
    data_context=context,
    class_name="SimpleCheckpoint",
    validations=validations,
    action_list=[
    {
      "name": "store_validation_result",
      "action": {
        "class_name": "StoreValidationResultAction"
      }
    }
  ]
)

# Set run ID
run_id = {
  "run_name": table_name+"_"+suite_name+"_run",
  "run_time": datetime.datetime.now(datetime.timezone.utc)
}

# run a data Validation
checkpoint_result = checkpoint.run(
    run_id=run_id,
    run_name_template="%Y%m%d_%H%M%S",
    action_list=[
    {
      "name": "store_validation_result",
      "action": {
        "class_name": "StoreValidationResultAction"
      }
    }
  ]
)
```