## Automate Data Quality Checks with Great Expectations
**Introduction**: In this activity, you will learn how to automate data quality checks using the Great Expectations framework. This includes setting up expectations and generating validation reports.

### Task 1: Setup and Initial Expectations

1. Objective: Set up Great Expectations and create initial expectations for a dataset.
2. Steps:
    - Install Great Expectations using pip.
    - Initialize a data context.
    - Create basic expectations on a sample dataset.
    - Eg., Implement a basic setup and expectation for column presence and type.

In [1]:
# Write your code from here
!pip install great_expectations --quiet

import os
import pandas as pd
from great_expectations.data_context import get_context
from great_expectations.checkpoint import SimpleCheckpoint
from great_expectations.datasource.fluent import PandasDatasource
from great_expectations.expectations.expectation_configuration import ExpectationConfiguration

# Step 1: Setup - Initialize Data Context
context_path = "ge_context"
if not os.path.exists(context_path):
    context = get_context(project_root_dir=context_path)
    context.create_expectation_suite("initial_suite", overwrite_existing=True)
else:
    context = get_context(project_root_dir=context_path)

# Step 2: Load a sample DataFrame
data = {
    "Name": ["Alice", "Bob", "Charlie", "David"],
    "Age": [25, 30, 35, 40],
    "Email": ["alice@example.com", "bob@example.com", "charlie@example.com", "david@example.com"]
}
df = pd.DataFrame(data)

# Step 3: Register Pandas datasource (using Fluent API)
if "my_pandas_ds" not in context.datasources:
    context.sources.add_pandas(name="my_pandas_ds")

# Step 4: Create expectation suite
context.create_expectation_suite(expectation_suite_name="basic_expectations", overwrite_existing=True)

# Step 5: Create a Validator
validator = context.sources.my_pandas_ds.read_dataframe(df=df).validate(expectation_suite_name="basic_expectations")

# Step 6: Add basic expectations
validator.expect_column_to_exist("Name")
validator.expect_column_values_to_be_of_type("Age", "int64")
validator.expect_column_values_to_match_regex("Email", r".+@.+\..+")

# Save expectations
validator.save_expectation_suite(discard_failed_expectations=False)

# Step 7: Create and run a simple checkpoint
checkpoint = SimpleCheckpoint(
    name="basic_checkpoint",
    data_context=context,
    validator=validator
)
checkpoint_result = checkpoint.run()

# View validation report
print("Validation results stored in:", checkpoint_result.run_id.run_name)


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


ImportError: cannot import name 'get_context' from 'great_expectations.data_context' (/home/vscode/.local/lib/python3.10/site-packages/great_expectations/data_context/__init__.py)

### Task 2: Validate Datasets and Generate Reports

1. Objective: Validate a dataset against defined expectations and generate a report.
2. Steps:
    - Execute the validation process on the dataset.
    - Review the validation results and generate a report.
    - Eg., Validate completeness and consistency expectations, and view the results.


In [2]:
import great_expectations as gx
import pandas as pd

df = pd.DataFrame({
    "id": [1, 2, 3, 4, 5],
    "age": [25, 30, None, 22, 28],
    "status": ["Active", "Inactive", "Active", "Active", None]
})

context = gx.get_context()

suite = context.create_expectation_suite(expectation_suite_name="validation_suite", overwrite_existing=True)

batch = gx.get_batch({
    "datasource_name": "default_pandas_datasource",
    "data_connector_name": "default_runtime_data_connector",
    "data_asset_name": "runtime_data_asset",
    "runtime_parameters": {"batch_data": df},
    "batch_identifiers": {"default_identifier_name": "default_identifier"},
})

batch.expect_column_values_to_not_be_null("age")
batch.expect_column_values_to_not_be_null("status")
batch.expect_column_values_to_be_in_set("status", ["Active", "Inactive"])

results = context.run_validation_operator(
    "action_list_operator",
    assets_to_validate=[batch],
    run_name="validation_run"
)

context.build_data_docs()
validation_result = context.open_data_docs()
print(results)


AttributeError: module 'great_expectations' has no attribute 'get_batch'

### Task 3: Advanced Expectations and Scheduling

1. Objective: Create advanced expectations for conditional checks and automate the validation.
2. Steps:
    - Define advanced expectations based on complex conditions.
    - Use scheduling tools to automate periodic checks.
    - E.g., an expectation that customer IDs must be unique and schedule a daily check.

In [3]:
import great_expectations as gx
import pandas as pd
from apscheduler.schedulers.blocking import BlockingScheduler

df = pd.DataFrame({
    "customer_id": [101, 102, 103, 104, 105],
    "purchase_amount": [250.0, 150.0, 400.0, 300.0, 150.0],
    "status": ["active", "active", "inactive", "active", "inactive"]
})

context = gx.get_context()

suite_name = "advanced_expectation_suite"
suite = context.create_expectation_suite(expectation_suite_name=suite_name, overwrite_existing=True)

batch = gx.get_batch({
    "datasource_name": "default_pandas_datasource",
    "data_connector_name": "default_runtime_data_connector",
    "data_asset_name": "runtime_data_asset",
    "runtime_parameters": {"batch_data": df},
    "batch_identifiers": {"default_identifier_name": "default_identifier"},
})

batch.expect_column_values_to_be_unique("customer_id")
batch.expect_column_values_to_be_between("purchase_amount", min_value=100, max_value=500)
batch.expect_column_values_to_match_regex("status", regex="^(active|inactive)$")

def run_validation():
    results = context.run_validation_operator(
        "action_list_operator",
        assets_to_validate=[batch],
        run_name="scheduled_validation_run"
    )
    context.build_data_docs()
    print(results)

scheduler = BlockingScheduler()
scheduler.add_job(run_validation, 'interval', days=1)
scheduler.start()









ModuleNotFoundError: No module named 'apscheduler'