# Data validation example notebook

In this tutorial notebook, we show how to implement and configure Great Expectations validation to run with your Kedro pipeline. We cover the following:

- Overview of `kedro-great-expectations` integration
- How to create and edit an expectation suite for a dataset
- How to add two types of expectations. (Table and column level)
- How to save expectations to our validation suite
- A review of the results in html

# Example of a GE validation  notebook
Use this notebook to recreate and modify your expectation suite:

**Expectation Suite Name**: `dataset_name` <br>
*for this tutorial we would use in_out_recent as our dataset*

In [1]:
%reload_kedro

2020-08-17 13:35:14,111 - root - INFO - ** Kedro project optimus_pkg
2020-08-17 13:35:14,112 - root - INFO - Defined global variable `context` and `catalog`
2020-08-17 13:35:14,117 - root - INFO - Registered line magic `run_viz`


### Read dataset and assign batch

In [2]:
from datetime import datetime

import great_expectations.jupyter_ux
from great_expectations.data_context.types.resource_identifiers import (
    ValidationResultIdentifier,
)
from kedro_great_expectations.config import KedroGEConfig
from kedro_great_expectations import ge_context as ge

kedro_ge_config = KedroGEConfig.for_interactive_mode(context)

data_context = ge.get_ge_context()

expectation_suite_name = "in_out_recent"
dataset_name = "in_out_recent"
suite = data_context.get_expectation_suite(expectation_suite_name)
suite.expectations = []

# Use kedro to load the dataset:
batch_kwargs = ge.get_batch_kwargs(
    data=catalog.load(dataset_name), ds_name=dataset_name, ge_context=data_context
)
batch = data_context.get_batch(batch_kwargs, suite.expectation_suite_name)
batch.head(5)

2020-08-17T13:35:16-0500 - INFO - Great Expectations logging enabled at 20 level by JupyterUX module.
2020-08-17 13:35:16,153 - great_expectations - INFO - Great Expectations logging enabled at 20 level by JupyterUX module.
  from collections import Mapping

  cls = validator_for(schema)

2020-08-17 13:35:16,461 - kedro.io.data_catalog - INFO - Loading data from `in_out_recent` (CSVDataSet)...
  self._batch_parameters = batch_parameters



Unnamed: 0,status_time,inp_quantity,cu_content,outp_quantity,inp_avg_hardness
0,2020-07-27 03:59:58,147.0,0.079816,159,0.478026
1,2020-07-27 04:14:59,230.0,0.079816,238,0.503276
2,2020-07-27 04:29:56,251.0,0.079816,246,0.506612
3,2020-07-27 04:45:04,250.0,0.079816,251,0.518337
4,2020-07-27 05:00:00,240.0,0.079683,242,0.489977


## Clear all expectations

If this is the first time you're editing this expectation suite and you've autogenerated the expectations, you may wish to clear all and add the expectations selectively.

In that case, run the code cell below and execute the cells containing the expectations you wish to keep before saving the suite. You can either delete the cells of those you don't wish to keep, but they will be automatically removed the next time you run `kedro ge edit in_out_recent` anyway.


In [3]:
batch._expectation_suite.expectations = []
from core_pipelines.kedro_utils.great_expectations.great_expectations_utils import *
params = context.params

### Table Expectation(s)

#### Validate if sensors are part of the dataframe

In [4]:
create_sensor_exist_expectation(batch, params)

#### Validate if tags are part of the dataframe

In [5]:
create_data_length_expectation(batch, params)

### Column Expectation(s)

#### Validate a dataset has no null values in column

In [6]:
create_not_null_expectations_from_tagdict(batch)

#### Validate the schema of a dataframe  with predefined key-pairs

In [7]:
create_data_schema_expectation(batch, params)

#### Validate the timestamp column of the dataframe and ensure it conforms to the format provided

In [8]:
create_time_format_expectation(batch, params)

#### Validate the value range of a dataset based on expected values defined in the TagDict

In [9]:
# load tag dictionary
td = catalog.load('td')
create_range_expectations_from_tagdict(batch, td)

2020-08-17 13:35:32,157 - kedro.io.data_catalog - INFO - Loading data from `td` (TagDictCSVLocalDataSet)...


#### Validate the sensor pairs to ensure if they have the same values

In [10]:
create_sensor_pair_equals_expectation(batch, params)

#### Validate sensor values are not violating flatline rules i.e. no data change with in a process period

In [11]:
create_flatline_expectation(batch, params)

#### Validate sensor values are not violating quantile anomaly detection

In [12]:
validate_column_quantile_anomaly(batch, params)

#### Validate sensor values are not violating level shift anomaly detection

In [13]:
create_level_shift_expectation(batch, params)

2020-08-17 13:35:37,340 - numexpr.utils - INFO - Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2020-08-17 13:35:37,341 - numexpr.utils - INFO - NumExpr defaulting to 8 threads.


In [14]:
validate_column_persist_anomaly(batch, params)

In [15]:
validate_multi_dimension_cluster_anomaly(batch, params)

  self._batch_parameters = batch_parameters

  self._batch_parameters = batch_parameters



## Save Your Expectations

Let's save the expectation suite as a JSON file in the `great_expectations/expectations` directory of your project.
If you decide not to save some expectations that you created, use the [remove_expectaton method](https://docs.greatexpectations.io/en/latest/module_docs/data_asset_module.html?highlight=remove_expectation&utm_source=notebook&utm_medium=edit_expectations#great_expectations.data_asset.data_asset.DataAsset.remove_expectation).


In [16]:
batch.save_expectation_suite(discard_failed_expectations=False)

2020-08-17T13:35:41-0500 - INFO - 	37 expectation(s) included in expectation_suite. result_format settings filtered.
2020-08-17 13:35:41,525 - great_expectations.data_asset.data_asset - INFO - 	37 expectation(s) included in expectation_suite. result_format settings filtered.
  cls = validator_for(schema)



## Review your Expectations (optional)

Let's now run the validation operators against your expectation suite and rebuild your Data Docs, which helps you communicate about your data with both machines and humans.


In [17]:
run_id = datetime.utcnow().strftime("%Y%m%dT%H%M%S.%fZ-kedro-ge-edit")

results = data_context.run_validation_operator("action_list_operator", assets_to_validate=[batch], run_id=run_id)
expectation_suite_identifier = list(results["details"].keys())[0]
validation_result_identifier = ValidationResultIdentifier(
    expectation_suite_identifier=expectation_suite_identifier,
    batch_identifier=batch.batch_kwargs.to_id(),
    run_id=run_id
)
data_context.build_data_docs()
data_context.open_data_docs(validation_result_identifier)


2020-08-17T13:35:43-0500 - INFO - 	37 expectation(s) included in expectation_suite.
2020-08-17 13:35:43,861 - great_expectations.data_asset.data_asset - INFO - 	37 expectation(s) included in expectation_suite.
  if not isinstance(data, list) and np.isnan(data):

