## Automate Data Quality Checks with Great Expectations
**Introduction**: In this activity, you will learn how to automate data quality checks using the Great Expectations framework. This includes setting up expectations and generating validation reports.

### Task 1: Setup and Initial Expectations

1. Objective: Set up Great Expectations and create initial expectations for a dataset.
2. Steps:
    - Install Great Expectations using pip.
    - Initialize a data context.
    - Create basic expectations on a sample dataset.
    - Eg., Implement a basic setup and expectation for column presence and type.

In [1]:
# Write your code from here
# Step 1: Install Great Expectations (run in terminal or notebook cell)
# !pip install great_expectations

import great_expectations as ge
import pandas as pd
from great_expectations.data_context import DataContext

# Step 2: Initialize Great Expectations Data Context (creates the folder structure)
context = DataContext.create(project_root_dir="great_expectations")
print("Great Expectations context created!")

# Step 3: Create sample pandas DataFrame
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "Email": ["alice@example.com", "bob@example.com", "charlie@example.com"]
}
df = pd.DataFrame(data)

# Step 4: Convert pandas DataFrame to Great Expectations dataset
ge_df = ge.from_pandas(df)

# Step 5: Define expectations
ge_df.expect_column_to_exist("Name")
ge_df.expect_column_to_exist("Age")
ge_df.expect_column_values_to_be_of_type("Age", "int64")
ge_df.expect_column_to_exist("Email")
ge_df.expect_column_values_to_match_regex("Email", r".+@.+\..+")

# Step 6: Run validation and print results
results = ge_df.validate()
print(results)


ImportError: cannot import name 'DataContext' from 'great_expectations.data_context' (/home/vscode/.local/lib/python3.10/site-packages/great_expectations/data_context/__init__.py)

### Task 2: Validate Datasets and Generate Reports

1. Objective: Validate a dataset against defined expectations and generate a report.
2. Steps:
    - Execute the validation process on the dataset.
    - Review the validation results and generate a report.
    - Eg., Validate completeness and consistency expectations, and view the results.


In [2]:
# Write your code from here
import great_expectations as ge
import pandas as pd

# Sample data
data = {
    "Name": ["Alice", "Bob", None, "David"],
    "Age": [25, 30, 35, None],
    "Email": ["alice@example.com", "bob@example", "charlie@example.com", None]
}
df = pd.DataFrame(data)

# Convert to Great Expectations dataframe
ge_df = ge.from_pandas(df)

# Define expectations
# Completeness: no nulls in 'Name' and 'Age'
ge_df.expect_column_values_to_not_be_null("Name")
ge_df.expect_column_values_to_not_be_null("Age")

# Consistency: Email format
ge_df.expect_column_values_to_match_regex("Email", r".+@.+\..+")

# Run validation
results = ge_df.validate()

# Print validation results summary
print("Validation Results Summary:")
print(results["statistics"])

# Optionally save full results to JSON file
import json
with open("validation_report.json", "w") as f:
    json.dump(results, f, indent=4)

print("\nValidation report saved to 'validation_report.json'")


AttributeError: module 'great_expectations' has no attribute 'from_pandas'

### Task 3: Advanced Expectations and Scheduling

1. Objective: Create advanced expectations for conditional checks and automate the validation.
2. Steps:
    - Define advanced expectations based on complex conditions.
    - Use scheduling tools to automate periodic checks.
    - E.g., an expectation that customer IDs must be unique and schedule a daily check.

In [4]:
import schedule
import time

def run_validation():
    # Recreate or load your data here or from CSV
    # (using the same df and ge_df as above for demonstration)
    ge_df = ge.from_pandas(df)

    # Define expectations again
    ge_df.expect_column_values_to_be_unique("CustomerID")
    ge_df.expect_column_values_to_be_between("Age", min_value=18)

    results = ge_df.validate()
    print("Scheduled Validation Run:")
    print(results["statistics"])

# Schedule the job to run daily at 9:00 AM
schedule.every().day.at("09:00").do(run_validation)

print("Scheduler started, press Ctrl+C to exit...")
while True:
    schedule.run_pending()
    time.sleep(60)  # wait 1 minute between checks


Scheduler started, press Ctrl+C to exit...


KeyboardInterrupt: 