## Automate Data Quality Checks with Great Expectations
**Introduction**: In this activity, you will learn how to automate data quality checks using the Great Expectations framework. This includes setting up expectations and generating validation reports.

### Task 1: Setup and Initial Expectations

1. Objective: Set up Great Expectations and create initial expectations for a dataset.
2. Steps:
    - Install Great Expectations using pip.
    - Initialize a data context.
    - Create basic expectations on a sample dataset.
    - Eg., Implement a basic setup and expectation for column presence and type.

In [1]:
# Write your code from here
import great_expectations as ge
from great_expectations.data_context import DataContext

# Load sample dataset (CSV or create a DataFrame)
import pandas as pd
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "Email": ["alice@example.com", "bob@example.com", "charlie@example.com"]
}
df = pd.DataFrame(data)

# Wrap DataFrame with Great Expectations Dataset
df_ge = ge.from_pandas(df)

# Create a new expectation suite or load existing one
expectation_suite_name = "basic_suite"
df_ge.create_expectation_suite(expectation_suite_name, overwrite_existing=True)

# Expect columns to exist
for col in ["Name", "Age", "Email"]:
    df_ge.expect_column_to_exist(col)

# Expect columns to be of specific types
df_ge.expect_column_values_to_be_of_type("Name", "str")
df_ge.expect_column_values_to_be_of_type("Age", "int")
df_ge.expect_column_values_to_be_of_type("Email", "str")

# Save expectation suite
df_ge.save_expectation_suite(expectation_suite_name=expectation_suite_name)

print(f"Created expectation suite '{expectation_suite_name}' with basic column presence and type checks.")


ImportError: cannot import name 'DataContext' from 'great_expectations.data_context' (/home/vscode/.local/lib/python3.10/site-packages/great_expectations/data_context/__init__.py)

### Task 2: Validate Datasets and Generate Reports

1. Objective: Validate a dataset against defined expectations and generate a report.
2. Steps:
    - Execute the validation process on the dataset.
    - Review the validation results and generate a report.
    - Eg., Validate completeness and consistency expectations, and view the results.


In [None]:
# Write your code from here
import great_expectations as ge

# Load the dataset (can be a CSV or a DataFrame wrapped as GE Dataset)
import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Charlie", None],
    "Age": [25, 30, 35, 40],
    "Email": ["alice@example.com", "bob@example.com", "charlie@example.com", "invalid_email"]
}
df = pd.DataFrame(data)

df_ge = ge.from_pandas(df)

# Load the expectation suite created previously
expectation_suite_name = "basic_suite"

# Validate the dataset against the expectation suite
validation_result = df_ge.validate(expectation_suite=expectation_suite_name, result_format="SUMMARY")

print("Validation Result Summary:")
print(validation_result)

# Generate an HTML validation report to visualize the results

from great_expectations.render.renderer import ValidationResultsPageRenderer

renderer = ValidationResultsPageRenderer()
rendered_content = renderer.render(validation_results=[validation_result])

# Save the report as HTML file
with open("validation_report.html", "w") as f:
    f.write(rendered_content)

print("Validation report generated: validation_report.html")


### Task 3: Advanced Expectations and Scheduling

1. Objective: Create advanced expectations for conditional checks and automate the validation.
2. Steps:
    - Define advanced expectations based on complex conditions.
    - Use scheduling tools to automate periodic checks.
    - E.g., an expectation that customer IDs must be unique and schedule a daily check.

In [None]:

# Write your code from here
import great_expectations as ge
import pandas as pd
import schedule
import time

def create_expectation_suite():
    # Sample data for creating expectations (could be replaced with real data)
    data = {
        "CustomerID": [101, 102, 103, 102, 105],  # Duplicate ID 102 for testing
        "Name": ["Alice", "Bob", "Charlie", "David", "Eva"],
    }
    df = pd.DataFrame(data)

    df_ge = ge.from_pandas(df)

    # Create or overwrite expectation suite
    expectation_suite_name = "advanced_suite"
    df_ge.create_expectation_suite(expectation_suite_name, overwrite_existing=True)

    # Expect CustomerID to be unique
    df_ge.expect_column_values_to_be_unique("CustomerID")

    # Save the expectation suite
    df_ge.save_expectation_suite(expectation_suite_name=expectation_suite_name)

    print(f"Expectation suite '{expectation_suite_name}' created with uniqueness check.")

def run_validation():
    print("Running data quality validation...")

    # Load data for validation (replace with real data source)
    data = {
        "CustomerID": [101, 102, 103, 102, 105],  # Same sample data for testing
        "Name": ["Alice", "Bob", "Charlie", "David", "Eva"],
    }
    df = pd.DataFrame(data)

    df_ge = ge.from_pandas(df)

    expectation_suite_name = "advanced_suite"

    # Validate data against expectation suite
    validation_result = df_ge.validate(expectation_suite=expectation_suite_name, result_format="SUMMARY")

    print(validation_result)

    # Optional: Add code here to save results or send alerts

def main():
    # Create expectation suite first (only needs to be done once or when suite changes)
    create_expectation_suite()

    # Schedule daily validation at 9:00 AM
    schedule.every().day.at("09:00").do(run_validation)

    print("Scheduler started. Waiting to run daily validations...")

    # Keep script running to execute scheduled jobs
    while True:
        schedule.run_pending()
        time.sleep(60)  # Check every minute

if __name__ == "__main__":
    main()
