## Automate Data Quality Checks with Great Expectations
**Introduction**: In this activity, you will learn how to automate data quality checks using the Great Expectations framework. This includes setting up expectations and generating validation reports.

### Task 1: Setup and Initial Expectations

1. Objective: Set up Great Expectations and create initial expectations for a dataset.
2. Steps:
    - Install Great Expectations using pip.
    - Initialize a data context.
    - Create basic expectations on a sample dataset.
    - Eg., Implement a basic setup and expectation for column presence and type.

In [1]:
# Write your code from here
import great_expectations as gx
import pandas as pd
import numpy as np

# 2. Initialize a Data Context
context = gx.get_context()

# 3. Create a sample dataset (or load your own)
data = pd.DataFrame({
    'customer_id': np.arange(1000, 1020),
    'name': ['John', 'Jane', 'Mike', 'Sarah', 'Alex', 'Emily', 'David', 'Lisa', 'Tom', 'Anna',
             'Chris', 'Emma', 'Ryan', 'Olivia', 'Daniel', 'Sophia', 'Matthew', 'Ava', 'Andrew', 'Mia'],
    'email': [f"user{i}@domain.com" for i in range(20)],
    'age': [25, 32, 45, 28, 30, 22, 38, 41, 29, 35, 
            31, 27, 33, 26, 40, 24, 36, 29, 34, 28],
    'join_date': pd.date_range('2023-01-01', periods=20),
    'purchase_amount': [round(np.random.uniform(10, 100), 2) for _ in range(20)],
    'is_active': [True, True, False, True, False, True, True, False, True, True,
                  False, True, False, True, True, False, True, False, True, True]
})

# 4. Create a Validator directly from the pandas DataFrame
validator = gx.from_pandas(data)

# 5. Create an Expectation Suite
expectation_suite_name = "customer_data_quality"
validator.save_expectation_suite(expectation_suite_name=expectation_suite_name)

# 6. Add Expectations (Data Quality Rules)
validator.expect_column_to_exist(column="customer_id")
validator.expect_column_values_to_be_of_type(column="customer_id", type_="int64")
validator.expect_column_values_to_be_unique(column="customer_id")

validator.expect_column_to_exist(column="name")
validator.expect_column_values_to_not_be_null(column="name")

validator.expect_column_to_exist(column="email")
validator.expect_column_values_to_match_regex(
    column="email", 
    regex=r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$"
)

validator.expect_column_to_exist(column="age")
validator.expect_column_values_to_be_between(column="age", min_value=18, max_value=100)

validator.expect_column_to_exist(column="join_date")
validator.expect_column_values_to_not_be_null(column="join_date")

validator.expect_column_to_exist(column="purchase_amount")
validator.expect_column_values_to_be_between(
    column="purchase_amount", 
    min_value=0,
    max_value=1000
)

validator.expect_column_to_exist(column="is_active")
validator.expect_column_values_to_be_of_type(column="is_active", type_="bool")

# 7. Save the Expectation Suite
validator.save_expectation_suite(discard_failed_expectations=False)

# 8. Validate the data
validation_result = validator.validate()

# 9. View the validation results
print("\nValidation Results Summary:")
print(f"Successful expectations: {validation_result.statistics['successful_expectations']}")
print(f"Failed expectations: {validation_result.statistics['unsuccessful_expectations']}")
print(f"Success %: {validation_result.statistics['success_percent']:.2f}%")

# 10. Generate HTML report (Data Docs)
context.build_data_docs()
context.open_data_docs()

AttributeError: module 'great_expectations' has no attribute 'from_pandas'

### Task 2: Validate Datasets and Generate Reports

1. Objective: Validate a dataset against defined expectations and generate a report.
2. Steps:
    - Execute the validation process on the dataset.
    - Review the validation results and generate a report.
    - Eg., Validate completeness and consistency expectations, and view the results.


In [2]:
# Write your code from here
import pandas as pd

df = pd.read_csv('dataset.csv')

completeness = df.notnull().mean() * 100

consistency = {}
if 'Age' in df.columns:
    consistency['Age_positive'] = (df['Age'] >= 0).mean() * 100

report = pd.DataFrame({
    'Metric': list(completeness.index) + list(consistency.keys()),
    'Value': list(completeness.values) + list(consistency.values())
})

print(report)

FileNotFoundError: [Errno 2] No such file or directory: 'dataset.csv'

### Task 3: Advanced Expectations and Scheduling

1. Objective: Create advanced expectations for conditional checks and automate the validation.
2. Steps:
    - Define advanced expectations based on complex conditions.
    - Use scheduling tools to automate periodic checks.
    - E.g., an expectation that customer IDs must be unique and schedule a daily check.

In [None]:
# Write your code from here
import pandas as pd
import schedule
import time

def validate_dataset():
    df = pd.read_csv('dataset.csv')

    expectations = {}

    if 'CustomerID' in df.columns:
        expectations['CustomerID_uniqueness'] = df['CustomerID'].is_unique

    if 'Email' in df.columns:
        expectations['Valid_Emails'] = df['Email'].str.contains('@').mean() * 100

    report = pd.DataFrame({
        'Expectation': expectations.keys(),
        'Result': expectations.values()
    })

    print(report)

schedule.every().day.at("10:00").do(validate_dataset)

while True:
    schedule.run_pending()
    time.sleep(60)