## Automate Data Quality Checks with Great Expectations
**Introduction**: In this activity, you will learn how to automate data quality checks using the Great Expectations framework. This includes setting up expectations and generating validation reports.

### Task 1: Setup and Initial Expectations

1. Objective: Set up Great Expectations and create initial expectations for a dataset.
2. Steps:
    - Install Great Expectations using pip.
    - Initialize a data context.
    - Create basic expectations on a sample dataset.
    - Eg., Implement a basic setup and expectation for column presence and type.

In [3]:
# Write your code from here

### Task 2: Validate Datasets and Generate Reports

1. Objective: Validate a dataset against defined expectations and generate a report.
2. Steps:
    - Execute the validation process on the dataset.
    - Review the validation results and generate a report.
    - Eg., Validate completeness and consistency expectations, and view the results.


In [4]:
# Write your code from here

### Task 3: Advanced Expectations and Scheduling

1. Objective: Create advanced expectations for conditional checks and automate the validation.
2. Steps:
    - Define advanced expectations based on complex conditions.
    - Use scheduling tools to automate periodic checks.
    - E.g., an expectation that customer IDs must be unique and schedule a daily check.

In [5]:
! pip install great_expectations --quiet



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [6]:
import pandas as pd

# Sample data
data = {
    "CustomerID": [1, 2, 3, 4, 5],
    "Name": ["John", "Jane", "Tom", "Alice", "Bob"],
    "Email": ["john@example.com", "jane@x.com", "tom@z.com", "alice@y.com", "bob@a.com"]
}
df = pd.DataFrame(data)
df.to_csv("customer_data.csv", index=False)
df.head()


Unnamed: 0,CustomerID,Name,Email
0,1,John,john@example.com
1,2,Jane,jane@x.com
2,3,Tom,tom@z.com
3,4,Alice,alice@y.com
4,5,Bob,bob@a.com


In [7]:
import great_expectations as gx
import os

# Initialize the context
context_root_dir = "great_expectations"
context = gx.get_context(context_root_dir=context_root_dir)


In [8]:
# Define a datasource for CSV via Pandas
datasource = context.sources.add_pandas_filesystem(
    name="customer_datasource",
    base_directory=os.getcwd(),  # current directory
)

datasource


AttributeError: 'FileDataContext' object has no attribute 'sources'

In [9]:
# Create an expectation suite
suite_name = "customer_suite"
context.add_or_update_expectation_suite(expectation_suite_name=suite_name)


AttributeError: 'FileDataContext' object has no attribute 'add_or_update_expectation_suite'

In [None]:
batch = context.get_batch(
    datasource_name="customer_datasource",
    data_connector_name="default_inferred_data_connector_name",
    data_asset_name="customer_data.csv",
    batch_identifiers={"default_identifier_name": "default"},
)

# Expectations
batch.expect_column_to_exist("CustomerID")
batch.expect_column_values_to_be_unique("CustomerID")
batch.expect_column_values_to_not_be_null("CustomerID")

batch.expect_column_to_exist("Email")
batch.expect_column_values_to_match_regex("Email", r"[^@]+@[^@]+\.[^@]+")

batch.save_expectation_suite(discard_failed_expectations=False)


In [None]:
checkpoint = context.add_or_update_checkpoint(
    name="customer_checkpoint",
    validator=batch,
)

results = checkpoint.run()
context.view_validation_result(results)
