# Template for the creation of an Expectations Suite 

This notebook is a template which shows you the main steps you have to follow in order to creare an Expectation Suite from a sample dataset.

The key steps presented in this notebook are the following:

1. Create the Spark Session and read the SparkDF
2. Define the Expectation Suite name
3. Configure the Great Expectations Data Context <br/>
  3.1 Data Source configuration <br/>
  3.2 Expectation suites and Validation Results stores configuration <br/>
  3.3 Instantiate the Data Context
4. Get a batch of data and instantiate a Validator object
5. Add Expectations to the Expectation Suite <br/>
  5.1 Add Custom Expectations <br/>
  5.2 Add native (built-in) Expectations
6. Save the Expectation Suite.

In [None]:
# import custom_expectations package
import sys
sys.path.append('../')

import custom_expectations

In [None]:
from pyspark.sql import SparkSession
import os

from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from great_expectations.data_context.types.base import DataContextConfig
from great_expectations.exceptions import DataContextError
from great_expectations.data_context import BaseDataContext
from great_expectations.core.batch import RuntimeBatchRequest

## 1) Read data as SparkDF

Create a local SparkSession and read data from the path `../../data/` as SparkDF.

In [None]:
spark = SparkSession.builder.master('local').getOrCreate()
sc = spark.sparkContext

In [None]:
schema = StructType([
    StructField("video_id",StringType(),True),
    StructField("time_spent", IntegerType(), True),
    StructField("video_duration", IntegerType(), True),
    StructField("customer_id", StringType(), True),
    StructField("user_id", StringType(), True),
    StructField("device_id", StringType(), True)
  ])

In [None]:
df = spark.read.format("csv")\
    .option("sep", ",")\
    .option("nullValue", "*")\
    .option("header", "true")\
    .option("escape", "\"")\
    .schema(schema)\
    .load("../../data/sample_data.csv")

## 2) Define the Expectation Suite name

Define the name of the Expectation Suite to develop. We can have multiple expectation suite for the same batch of data thus we've decided to name Expectation Suites as `table_name + "." + suite_name`.

In [None]:
table_name = "sample_data"
suite_name = "data_quality_check"
expectation_suite_name = table_name+"."+suite_name

## 3) Configure Great Expectations Data Context

Instantiate the Great Expectations Data Context based on the official guide: [_How to instantiate a data context without a yml file_](https://docs.greatexpectations.io/docs/guides/setup/configuring_data_contexts/how_to_instantiate_a_data_context_without_a_yml_file). <br />
In the [Data Context](https://docs.greatexpectations.io/docs/terms/data_context/) we have to define all the necessary information to create our expectation suite: 
- where to store the expectation suite,
- where to store the validation results,
- what engine to use (Pandas, Spark or SQLAlchemy),
- how to connect to your input data.

### 3.1) Data Source configuration

Since we are developing an [expectation suite over an in-memory SparkDF](https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/in_memory/spark), we configure datasource as follow:
- `SparkDFExecutionEngine` as _**execution_engine**_'s _class_name_
- `RuntimeDataConnector` as _**dataconnector**_

In [None]:
datasources={
    "filesystem_datasource": {
        "class_name": "Datasource",
        "module_name": "great_expectations.datasource",
        'execution_engine': {
            'module_name': 'great_expectations.execution_engine',
            'class_name': 'SparkDFExecutionEngine'
        },
        "data_connectors": {
            "runtime_data_connector": {
                "class_name": "RuntimeDataConnector",
                "batch_identifiers": ["batch_id"],
            },
        },
    }
}

### 3.2) Expectation suites and Validation Results stores configuration

In the Data Context we must specify the paths where to write expectation suite and where to read the validation 
results. Paths can be related to either [Amazon S3](https://docs.greatexpectations.io/docs/guides/setup/configuring_metadata_stores/how_to_configure_an_expectation_store_in_amazon_s3) or [Local Filesystem](https://docs.greatexpectations.io/docs/guides/setup/configuring_metadata_stores/how_to_configure_an_expectation_store_on_a_filesystem). 
<br/>
In the following cells we show how to configure the expectations and validations stores for those two scenarios.

In [None]:
# Choose where to store the expectations: S3 or Local Filesystem
expectations_s3_store = {
    "class_name": "ExpectationsStore",
    "store_backend": {
        "class_name": "TupleS3StoreBackend",
        "bucket": "bucket-name",
        "prefix": "folder/name",
     }
}
expectations_filesystem_store = {
    "class_name": "ExpectationsStore",
    "store_backend": {
        "class_name": "TupleFilesystemStoreBackend",
        "base_directory": "/home/jovyan/work/expectation_suites",
    }
}

In [None]:
# Choose where to store the validations: S3 or Local Filesystem
validations_s3_store = {
    "class_name": "ValidationsStore",
    "store_backend": {
        "class_name": "TupleS3StoreBackend",
        "bucket": "bucket-name",
        "prefix": "folder/name",
     }
}
validations_filesystem_store = {
    "class_name": "ValidationsStore",
    "store_backend": {
        "class_name": "TupleFilesystemStoreBackend",
        "base_directory": "/home/jovyan/work/validations",
    }
}

### 3.3) Instantiate the Data Context

Finally we instantiate the `DataContextConfig` specifying the previously configured datasource, expectations store and validations store and setting the path where to store the Data Docs website (with `data_docs_sites`).

In [None]:
data_context_config = DataContextConfig(
    datasources=datasources,
    stores={
        "expectations_store": expectations_filesystem_store,
        "validations_store": validations_filesystem_store,
        "evaluation_parameter_store": {"class_name": "EvaluationParameterStore"},
    },
    expectations_store_name="expectations_store",
    validations_store_name="validations_store",
    evaluation_parameter_store_name="evaluation_parameter_store",
    data_docs_sites={
        "dq_website": {
            "class_name": "SiteBuilder",
            "store_backend": {
                "class_name": "TupleFilesystemStoreBackend",
                "base_directory": "/home/jovyan/work/site",
            },
            "site_index_builder": {
                "class_name": "DefaultSiteIndexBuilder",
                "show_cta_footer": False,
            },
        }
    },
    anonymous_usage_statistics={
      "enabled": False
    }
)

In [None]:
context = BaseDataContext(project_config=data_context_config)

Once instantiated the `BaseDataContext`, run the cell below to create a new empty Expectation Suite if it doesn't already exist or choose an existing one to edit it.

In [None]:
try:
    suite = context.create_expectation_suite(
        expectation_suite_name,
        overwrite_existing=True # Configure this parameter for your needs
    )
except DataContextError:
    print("\'{}\' already exists and it will be overwritten.".format(expectation_suite_name))
else:
    print("\'{}\' suite doesn't exists. \n A new one has been created.".format(expectation_suite_name))

## 4) Get a batch of data and instantiate a Validator object

Create a [`RuntimeBatchRequest`](https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/how_to_configure_a_runtimedataconnector/#example-2-runtimedataconnector-that-uses-an-in-memory-dataframe) where you specify which Batch of data you would like to check.<br/>
**Note**: in this case the batch is composed by the entire dataset previously read as SparkDF.

In [None]:
batch_request = RuntimeBatchRequest(
    datasource_name="filesystem_datasource",
    data_connector_name="runtime_data_connector",
    data_asset_name="data_asset_name",
    batch_identifiers={"batch_id": "something_something"},
    runtime_parameters={"batch_data": df},
)

Once done, instantiate a [`Validator`](https://docs.greatexpectations.io/docs/guides/expectations/how_to_create_and_edit_expectations_with_a_profiler#3-instantiate-your-validator) object 
to access and interact with your data and to start to work with your Expectation Suite.

In [None]:
validator = context.get_validator(
    batch_request=batch_request, 
    expectation_suite_name=expectation_suite_name
)

## 5) Add Expectations to the Expectation Suite

Great Expectations provides you two kinds of Expectations to test your data:

- native expectations 
- custom expectations

Below we show how to define both types of expectations and how to add them to the Expectations Suite through the Validator object.

### 5.1) Add Custom Expectations to the Expectation Suite

[Custom Expectations](https://docs.greatexpectations.io/docs/guides/expectations/creating_custom_expectations/overview) are user defined types of expectations that are not present in the [Expectation Gallery](https://greatexpectations.io/expectations/) that allow you to create totally custom checks over your batch of data.<br/>
This type of expectations are importable Python modules which contain all the logics of the check and all the informations to correctly render the expectation inside the generated Data docs.

Currently we are storing all the Custom Expectations in the relative path `../custom_expectations`. In order to call the Custom Expectations during the Expectation Suite creation, you must import them in your notebook kernel session (see the first cell of this notebook).

In this case we are going to show three different types of Custom Expectations:
* **Single Column Expectations**: this type of Custom Expectations allows you to create custom checks over a single table's column.
* **Pair Columns Expectations**: this type of Custom Expectations allows you to create custom check over a pair of table columns.
* **Multi Columns Expectations**: this type of Custom Expectations allows you to create custom check over a subset of table columns.

### Single column expectation with `video_id` column

In [None]:
validator.expect_column_length_match_input_length(
    column='video_id', 
    length=11
)

### Pair column expectation with `time_spent` and `video_duration` columns

In [None]:
validator.expect_column_pair_a_to_be_approximately_smaller_or_equal_than_b(
    column_A='time_spent',
    column_B='video_duration',
    n_approximate=1
)

### Multicolumn expectation with `customer_id`, `user_id` and `device_id` columns

In [None]:
validator.expect_multicolumn_customer_id_user_id_device_id(
    column_list=['customer_id', 'user_id', 'device_id'],
    device_id_regex='d[0-9]{3}$'
)

### 5.2) Add native (built-in) Expectations to the Expectation Suite

The native expectations are all the built-in checks that are present in the [Expectations gallery](https://greatexpectations.io/expectations/). As showed below you can easily add an expectation to your validator object by calling the relative method.

### video_id

In [None]:
column_name = 'video_id'

In [None]:
validator.expect_column_values_to_be_in_type_list(column_name, ['StringType'])

In [None]:
validator.expect_column_values_to_match_regex(column_name, regex='V[0-9]')

In [None]:
validator.expect_column_values_to_not_be_null(column_name)

### time_spent

In [None]:
column_name = "time_spent"

In [None]:
validator.expect_column_values_to_be_in_type_list(column_name, ['IntegerType'])

In [None]:
validator.expect_column_min_to_be_between(column_name, min_value=0)

In [None]:
validator.expect_column_values_to_not_be_null(column_name)

### video_duration

In [None]:
column_name = 'video_duration'

In [None]:
validator.expect_column_values_to_be_in_type_list(column_name, ['IntegerType'])

In [None]:
validator.expect_column_values_to_be_between(column_name, min_value = 0, max_value = 3600)

In [None]:
validator.expect_column_values_to_not_be_null(column_name)

### customer_id

In [None]:
column_name = 'customer_id'

In [None]:
validator.expect_column_values_to_be_in_type_list(column_name, ["StringType"])

In [None]:
validator.expect_column_values_to_not_be_null(column_name)

### user_uid

In [None]:
column_name = 'user_id'

In [None]:
validator.expect_column_values_to_be_in_type_list(column_name, ['StringType'])

In [None]:
validator.expect_column_values_to_match_regex(column_name, '[0-9]{4}$')

In [None]:
validator.expect_column_values_to_not_be_null(column_name)

### device_id

In [None]:
column_name = 'device_id'

In [None]:
validator.expect_column_values_to_be_in_type_list(column_name, ["StringType"])

In [None]:
validator.expect_column_values_to_match_regex(column_name, 'd[0-9]{3}$')

In [None]:
validator.expect_column_values_to_not_be_null(column_name)

## 6) Save the Expectation Suite

Persist the Expectation Suite into the path defined in the DataContext, by running `.save_expectation_suite()` method.

In [None]:
validator.save_expectation_suite(discard_failed_expectations=False)

#### To be continued...

Now you can run a data validation using the Expectation Suite you just create.<br/>
Check the notebook [template for data validation with Great Expectations Checkpoints](../validate_data/data_validation_with_checkpoints_template.ipynb).