# Database data validation

In this notebook, we will see how we can use the great_expectations package to validate data in our database.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NatanMish/data_validation/blob/main/notebooks/database_data_validation.ipynb)

#### Install the required packages and import them to the notebook


In [None]:
!pip install -U great_expectations pandas

In [None]:
# import the required packages
import great_expectations as ge
from ruamel import yaml
import pandas as pd

[![Great Expectations](https://docs.greatexpectations.io/img/great-expectations-long-logo.svg)]

Great Expectations is a shared, open sourced package for data quality. It helps eliminate pipeline debt, through data testing, documentation, and profiling. It is a tool for data scientists, data engineers, and data analysts to validate data. GE has many useful integrations and can be connected directly to SQL databases, Apache Spark, Apache Airflow, Bigquery, and more. In this tutorial, we will validate a database hosted on a  local file system, but the process for a cloud file system such as a Data Lake, Azure Blob Storage, GCP bucket or AWS S3 is almost identical.

**Terminology**
1. *Data Context* - The primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components.

2. *Data Source* - Provides a standard API for accessing and interacting with data from a wide variety of source systems.

3. *Data Asset* - A collection of records within a Datasource which is usually named based on the underlying data system and sliced to correspond to a desired specification.

4. *Expectation Suite* - A collection of verifiable assertions about data.

5. *Validation* - The act of applying an Expectation Suite to a Batch.

6. *Batch Identifier* - contains information that uniquely identifies a specific batch from the Data Asset, such as the delivery date or query time.

7. *Data Connector* - Provides the configuration details based on the source data system which are needed by a Datasource to define Data Assets.

### 1. Create a Data Context

We will now create a data context, which is the first step in setting up Great Expectations for our project. Creating a data context is actually most easily done in bash using the great_expectations CLI. Run the bash command below and this will initialize a new data context in the current directory. The `echo y` bit is used to suppress the interactive prompt. You will now see a new directory called `great_expectations` created in your current directory.

In [None]:
!echo y | great_expectations init

After running the init command, your great_expectations directory will contain all the important components of a local Great Expectations deployment. This is what the directory structure looks like:

- `great_expectations.yml` contains the main configuration of your deployment.
The expectations directory stores all your Expectations as JSON files. If you want to store them somewhere else, you can change that later.

- The `plugins/` directory holds code for any custom plugins you develop as part of your deployment.

- The `uncommitted/` directory contains files that shouldn’t live in version control. It has a .gitignore configured to exclude all its contents from version control. The main contents of the directory are:
    1. `uncommitted/config_variables.yml`, which holds sensitive information, such as database credentials and other secrets.
    2. `uncommitted/data_docs`, which contains Data Docs generated from Expectations, Validation Results, and other metadata.
    3. `uncommitted/validations`, which holds Validation Results generated by Great Expectations.

<div>
<img src="https://docs.greatexpectations.io/assets/images/data_context_does_for_you-df2eca32d0152ead16cccd5d3d226abb.png" width="1000"/>
</div>

### 2. Create a Data Source

In [None]:
# We will start by reading in the GE data context we have created in the previous step
context = ge.get_context()

Now we will script a yaml file to create a data source. We will need the following configuration parameters:

In [None]:
datasource_name = "house_prices"
# Data Source - Provides a standard API for accessing and interacting with data from a wide variety of source systems.

In [None]:
execution_engine = "PandasExecutionEngine"  # alternatively we can use SparkExecutionEngine for PySpark oriented
# projects or SqlAlchemyExecutionEngine for creating a SQL database data source.
data_directory = "data"

In [None]:
data_asset_name = f"{datasource_name}_survey_2006"
# Data Asset - A collection of records within a Datasource which is usually named based on the underlying data system and sliced to correspond to a desired specification.

In [None]:
runtime_data_connector_name = "runtime_batch_files_connector"
# Data Connector - Provides the configuration details based on the source data system which are needed by a Datasource to define Data Assets.

In [None]:
datasource_config = {
    "name": datasource_name,
    "class_name": "Datasource",
    "module_name": "great_expectations.datasource",
    "execution_engine": {
        "module_name": "great_expectations.execution_engine",
        "class_name": execution_engine,
    },
    "data_connectors": {
        runtime_data_connector_name: {
            "class_name": "RuntimeDataConnector",
            "batch_identifiers": ["default_identifier_name"],
        },
    },
}

In [None]:
# Test that the configuration is valid
context.test_yaml_config(yaml.dump(datasource_config))

In [None]:
# If the configuration is valid, we can create the datasource
context.add_datasource(**datasource_config)

In [None]:
# Now we can see that the datasource was created.
context.list_datasources()

### 3. Create an Expectation Suite
Expectations are the core of Great Expectations. They are the assertions that are used to validate data. Let's create an expectation suite which is a collection of expectations. This diagram below shows how we can define good expectations for our data.

<div>
<img src="https://docs.greatexpectations.io/assets/images/where_expectations_come_from-b3504cf51ad304c8e4a73677a0e73156.png" width="1000"/>
</div>

We will create expectations while exploring the data in the notebook. The method below behaves  exactly the same as `pandas.read_csv`. Similarly wrapped versions of other pandas methods (`read_excel`, `read_table`, `read_parquet`, `read_pickle`, `read_json`, etc.) are also available.

In [None]:
home_data = ge.read_csv("https://github.com/NatanMish/data_validation/blob/a77b247b25c6622ce0c8f8cbc505228161c31a3c/data/train.csv?raw=true")

In [None]:
# The home_data variable is a pandas dataframe with all the methods and properties we know and love. We can use the `head` method to see the
# first few rows of the data.
home_data.head()

In [None]:
# beyond the Pandas methods and properties, we can use GE's expectations methods to define expectations. 
# In Jupyter, type in `home_data.expect` and press tab to see the list of available expectations.
home_data.expect

In [None]:
# Let's create a few example expectations and see if they are valid on this dataset.
home_data.expect_column_to_exist("Id")

Notice the `"success": true` key in the result dictionary, this means the expectation is valid for this data source

In [None]:
home_data.expect_column_values_to_be_unique("Id")

The expectation above checked the contents of the column, hence we got a few other useful metrics, showing how many 
rows were inspected, how many were missing etc.

In [None]:
# This expectation should fail, lets see what happens:
home_data.expect_column_max_to_be_between("SalePrice", 0, 100000)

The returned dictionary shows that the expectation is not valid, and the value observed that is not in the expected range.
Here are a few more useful expectation definitions:

In [None]:
home_data.expect_column_distinct_values_to_be_in_set("MSZoning", ["C (all)", "FV", "RH", "RL", "RM"])

In [None]:
home_data.expect_column_mean_to_be_between("GrLivArea", 0, 10000)

In [None]:
# This will create an expectation suite from all the valid expectations we created above.
home_data.get_expectation_suite()
# If we want the non-valid expectations as well, we can use the `get_expectation_suite` method with the 
# `discard_failed_expectations` parameter set to True. If there are any duplicate expectations in the suite, 
# the duplicates will be discarded:
# home_data.get_expectation_suite(discard_failed_expectations=False)

In [None]:
# This line will save the expectation suite to the data context
context.save_expectation_suite(home_data.get_expectation_suite(), "my_expectations")

### Exercise 1
Check the following expectations to see if they are valid on the home_data dataframe:

(Not all the expectations were included in the examples above. You can find more expectations in the [expectations directory](https://greatexpectations.io/expectations).)
1. `Street` column should be a string.
2. `LandContour` column cannot be null.
3. `YearBuilt` minimal value should be between 1700 and 1900.
4. `LotArea` median value should be between 5000 and 15000.
5. The most common values in `SaleType` must be either `WD` or `New`.

In [None]:
# your answers here:
# home_data.expect_column_
# home_data.expect_column_
# home_data.expect_column_
# home_data.expect_column_
# home_data.expect_column_

*Exercise solutions can be found in the exercise solutions file in the current directory.*

### 4. Validate the Data
We will now validate the test data using the expectations we have created for the train data.

In [None]:
checkpoint_name = "data_batch_appended"
# Checkpoint - The primary means for validating data in a production deployment of Great Expectations.

In [None]:
checkpoint_config = {
    "name": checkpoint_name,
    "config_version": 1,
    "class_name": "SimpleCheckpoint",
    "validations": [
        {
            "batch_request": {
                "datasource_name": datasource_name,
                "data_connector_name": runtime_data_connector_name,
                "data_asset_name": data_asset_name,
            },
            "expectation_suite_name": "my_expectations",
        }
    ],
}
context.add_checkpoint(**checkpoint_config)

Looking at the dictionary returned by the `add_checkpoint` methods we can see what are the actions performed every time the checkpoint will run:
1. Store validation result.
2. Store evaluation parameters.
3. Update data docs. (we will look at the data docs later in this notebook)

In [None]:
home_data_test = pd.read_csv("https://github.com/NatanMish/data_validation/blob/a77b247b25c6622ce0c8f8cbc505228161c31a3c/data/test.csv?raw=true")

In [None]:
results = context.run_checkpoint(
    checkpoint_name=checkpoint_name,
    batch_request={
        "runtime_parameters": {"batch_data": home_data_test},
        "batch_identifiers": {
            "default_identifier_name": "default_identifier_name"
        },
    },
)
# Batch Identifier - contains information that uniquely identifies a specific batch from the Data Asset, such as the delivery date or query time.

In [None]:
# Let's take a look at the validation result object we got:
run_identifier = next(iter(results['run_results']))
results['run_results'][run_identifier]['validation_result']['statistics']

In [None]:
# Here is an example of one of the validations on one of the expectations. The check has passed and there are some 
# useful extra details too.
results['run_results'][run_identifier]['validation_result']['results'][2]

#### How does an invalid data checkpoint look like?
Glad you asked, let's inject a null value to our `Id` column to see how it behaves:

In [None]:
# This will create a duplicate id value for two separate records
home_data_test.at[0, 'Id'] = 1462

In [None]:
bad_data_checkpoint_name = "my_bad_data_checkpoint"
bad_data_checkpoint_config = {
    "name": bad_data_checkpoint_name,
    "config_version": 1,
    "class_name": "SimpleCheckpoint",
    "validations": [
        {
            "batch_request": {
                "datasource_name": datasource_name,
                "data_connector_name": runtime_data_connector_name,
                "data_asset_name": "batch_data_asset",
            },
            "expectation_suite_name": "my_expectations",
        }
    ],
}
context.add_checkpoint(**bad_data_checkpoint_config)

In [None]:
results_bad_data_checkpoint = context.run_checkpoint(
    checkpoint_name=bad_data_checkpoint_name,
    batch_request={
        "runtime_parameters": {"batch_data": home_data_test},
        "batch_identifiers": {
            "default_identifier_name": "default_identifier_name"
        },
    },
)

In [None]:
# As expected, not all expectations were successful.
bad_data_run_identifier = next(iter(results_bad_data_checkpoint['run_results']))
results_bad_data_checkpoint['run_results'][bad_data_run_identifier]['validation_result']['statistics']

In [None]:
# And here is the summary for the failed expectation
results_bad_data_checkpoint['run_results'][bad_data_run_identifier]['validation_result']['results'][1]