## Automate Data Quality Checks with Great Expectations
**Introduction**: In this activity, you will learn how to automate data quality checks using the Great Expectations framework. This includes setting up expectations and generating validation reports.

### Task 1: Setup and Initial Expectations

1. Objective: Set up Great Expectations and create initial expectations for a dataset.
2. Steps:
    - Install Great Expectations using pip.
    - Initialize a data context.
    - Create basic expectations on a sample dataset.
    - Eg., Implement a basic setup and expectation for column presence and type.

In [9]:
import great_expectations as ge
from great_expectations.dataset import PandasDataset
import pandas as pd
import schedule
import time
import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def initialize_data_context():
    try:
        context = ge.DataContext()
        logging.info("Great Expectations context initialized.")
        return context
    except Exception as e:
        logging.error(f"Failed to initialize GE context: {e}")
        return None

def load_sample_data():
    data = {
        "CustomerID": [101, 102, 103, 102],
        "Age": [25, 30, 17, 45],
        "Email": ["a@example.com", "b@example.com", "c@example.com", "invalid-email"]
    }
    try:
        df = pd.DataFrame(data)
        logging.info("Sample data loaded.")
        return df
    except Exception as e:
        logging.error(f"Failed to load data: {e}")
        return None

def define_expectations(ge_df):
    try:
        ge_df.expect_column_to_exist("CustomerID")
        ge_df.expect_column_values_to_be_unique("CustomerID")
        ge_df.expect_column_to_exist("Age")
        ge_df.expect_column_values_to_be_between("Age", min_value=18)
        ge_df.expect_column_to_exist("Email")
        ge_df.expect_column_values_to_match_regex("Email", r".+@.+\..+")
        logging.info("Expectations defined successfully.")
    except Exception as e:
        logging.error(f"Failed to define expectations: {e}")

def run_validation(ge_df):
    try:
        results = ge_df.validate()
        logging.info("Validation completed.")
        return results
    except Exception as e:
        logging.error(f"Validation failed: {e}")
        return None

def scheduled_validation():
    logging.info("Scheduled validation started.")
    df = load_sample_data()
    if df is None:
        logging.error("No data loaded, skipping validation.")
        return
    ge_df = PandasDataset(df)
    define_expectations(ge_df)
    results = run_validation(ge_df)
    if results:
        logging.info(f"Validation summary: {results['statistics']}")
    else:
        logging.error("No validation results to report.")

if __name__ == "__main__":
    context = initialize_data_context()
    if context is None:
        logging.error("Exiting due to failure initializing Great Expectations context.")
        exit(1)

    df = load_sample_data()
    if df is not None:
        ge_df = PandasDataset(df)
        define_expectations(ge_df)
        results = run_validation(ge_df)
        if results:
            print("One-time Validation Results:")
            print(results)

    schedule.every().day.at("09:00").do(scheduled_validation)

    logging.info("Scheduler started. Press Ctrl+C to exit.")
    while True:
        schedule.run_pending()
        time.sleep(60)


ModuleNotFoundError: No module named 'great_expectations.dataset'

### Task 2: Validate Datasets and Generate Reports

1. Objective: Validate a dataset against defined expectations and generate a report.
2. Steps:
    - Execute the validation process on the dataset.
    - Review the validation results and generate a report.
    - Eg., Validate completeness and consistency expectations, and view the results.


AttributeError: module 'great_expectations' has no attribute 'from_pandas'

### Task 3: Advanced Expectations and Scheduling

1. Objective: Create advanced expectations for conditional checks and automate the validation.
2. Steps:
    - Define advanced expectations based on complex conditions.
    - Use scheduling tools to automate periodic checks.
    - E.g., an expectation that customer IDs must be unique and schedule a daily check.

Scheduler started, press Ctrl+C to exit...


KeyboardInterrupt: 