## Automate Data Quality Checks with Great Expectations
**Introduction**: In this activity, you will learn how to automate data quality checks using the Great Expectations framework. This includes setting up expectations and generating validation reports.

### Task 1: Setup and Initial Expectations

1. Objective: Set up Great Expectations and create initial expectations for a dataset.
2. Steps:
    - Install Great Expectations using pip.
    - Initialize a data context.
    - Create basic expectations on a sample dataset.
    - Eg., Implement a basic setup and expectation for column presence and type.

In [None]:
# Write your code from here

Working directory: /workspaces/AI_DATA_ANALYSIS_/src/Module 8/Hands-on - Data Quality Scoring & Automation/my_ge_project
Great Expectations project will be created at: /workspaces/AI_DATA_ANALYSIS_/src/Module 8/Hands-on - Data Quality Scoring & Automation/my_ge_project/my_ge_project_jupyter
Creating project directory: /workspaces/AI_DATA_ANALYSIS_/src/Module 8/Hands-on - Data Quality Scoring & Automation/my_ge_project/my_ge_project_jupyter

--- 1. Create Sample Dataset ---
Sample data created at: /workspaces/AI_DATA_ANALYSIS_/src/Module 8/Hands-on - Data Quality Scoring & Automation/my_ge_project/my_ge_project_jupyter/data/sample_data.csv
Sample Data Preview:
   id     name  age           city
0   1    Alice   30       New York
1   2      Bob   24    Los Angeles
2   3  Charlie   35        Chicago
3   4    David   29       New York
4   5      Eve   42  San Francisco

--- 2. Initialize Great Expectations Data Context (Interactive Step) ---
Please run the following command in a new termin

### Task 2: Validate Datasets and Generate Reports

1. Objective: Validate a dataset against defined expectations and generate a report.
2. Steps:
    - Execute the validation process on the dataset.
    - Review the validation results and generate a report.
    - Eg., Validate completeness and consistency expectations, and view the results.


In [3]:
import os
import pandas as pd
from great_expectations.data_context import DataContext
from great_expectations.core.batch import BatchRequest
from great_expectations.checkpoint import Checkpoint

# --- Configuration (ensure these match your Task 1 setup) ---
PROJECT_DIR_NAME = "my_ge_project_jupyter"
GE_PROJECT_PATH = os.path.join(os.getcwd(), PROJECT_DIR_NAME) # Assuming you are in the notebook's original root
DATA_DIR_NAME = "data"
SAMPLE_DATA_FILE_PATH = os.path.join(GE_PROJECT_PATH, DATA_DIR_NAME, "sample_data.csv")
EXPECTATION_SUITE_NAME = "my_sample_data_expectations"

# --- Ensure we are in the correct directory for DataContext to load ---
# This is crucial if you ran previous cells and moved directories or restarted kernel
if os.getcwd() != GE_PROJECT_PATH:
    print(f"Changing current working directory to: {GE_PROJECT_PATH}")
    os.chdir(GE_PROJECT_PATH)
else:
    print(f"Already in the correct directory: {os.getcwd()}")


print("\n--- Task 2: Validate Datasets and Generate Reports ---")

print("\n--- 1. Objective: Validate a dataset against defined expectations and generate a report. ---")

# Load the DataContext. This assumes `great_expectations init` was run and the GE directory exists.
context = DataContext()
print("Great Expectations DataContext loaded.")

# Define the batch request to identify the data to be validated
batch_request = BatchRequest(
    datasource_name="my_sample_data_source",
    data_connector_name="default_inferred_data_connector",
    data_asset_name="sample_data",
    # Path is relative to the base_directory configured for the data connector
    batch_spec_passthrough={"path": "sample_data.csv"}
)

print(f"\nAttempting to validate data asset: {batch_request.data_asset_name}")
print(f"Using expectation suite: {EXPECTATION_SUITE_NAME}")

# --- 2. Execute the validation process on the dataset. ---

# Great Expectations often uses Checkpoints for validation workflows.
# Let's create a simple in-memory Checkpoint for this validation.
# In a production setup, you would define Checkpoints in great_expectations/checkpoints/

checkpoint_config = {
    "name": "my_sample_checkpoint",
    "config_version": 1,
    "class_name": "SimpleCheckpoint",
    "run_name_template": "%Y%m%d-%H%M%S-validation", # A template for naming validation runs
    "validations": [
        {
            "batch_request": batch_request,
            "expectation_suite_name": EXPECTATION_SUITE_NAME,
        }
    ]
}

# Add the checkpoint to the context (it will be saved to your great_expectations/checkpoints directory)
context.add_checkpoint(**checkpoint_config)
print("Checkpoint 'my_sample_checkpoint' added to DataContext.")

# Get the checkpoint and run the validation
checkpoint = context.get_checkpoint(name="my_sample_checkpoint")
checkpoint_result = checkpoint.run()

print("\n--- Validation Execution Complete ---")
print(f"Validation successful: {checkpoint_result.success}")
print(f"Number of expectations run: {checkpoint_result.statistics['evaluated_expectations']}")
print(f"Number of expectations passed: {checkpoint_result.statistics['successful_expectations']}")
print(f"Number of expectations failed: {checkpoint_result.statistics['unsuccessful_expectations']}")


# --- 3. Review the validation results and generate a report. ---
# The build_data_docs() method automatically generates HTML reports from validation results.
# It picks up the results from your latest validation runs.

print("\n--- Generating Data Docs to review results ---")
context.build_data_docs()
print("Data Docs built successfully.")

# Get the path to the latest data docs index.html
data_docs_path = os.path.abspath(os.path.join('great_expectations', 'uncommitted', 'data_docs', 'index.html'))
print(f"To view the validation report, open this file in your web browser: {data_docs_path}")

# Optional: Add more specific expectations to demonstrate validation failures
# For example, let's create a slightly modified dataset that might fail some checks
print("\n--- Optional: Demonstrating Validation Failures ---")
print("Creating a 'bad_sample_data.csv' to show failed expectations.")
bad_data = {
    'id': [1, 'two', 3, 4, 5], # 'two' is wrong type
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'age': [30, 24, 35, None, 42], # None is missing value (will be float)
    'city': ['New York', 'Los Angeles', 'Chicago', 'New York', 'San Francisco', 'Extra City'] # Extra column for match_set failure (if schema check were stricter)
}
bad_df = pd.DataFrame(bad_data)
bad_sample_data_file_path = os.path.join(GE_PROJECT_PATH, DATA_DIR_NAME, "bad_sample_data.csv")
bad_df.to_csv(bad_sample_data_file_path, index=False)
print(f"Bad sample data created at: {bad_sample_data_file_path}")

# Define a new batch request for the bad data
bad_batch_request = BatchRequest(
    datasource_name="my_sample_data_source",
    data_connector_name="default_inferred_data_connector",
    data_asset_name="bad_sample_data", # New data asset name
    batch_spec_passthrough={"path": "bad_sample_data.csv"}
)

# Create a new checkpoint for the bad data validation
bad_checkpoint_config = {
    "name": "my_bad_data_checkpoint",
    "config_version": 1,
    "class_name": "SimpleCheckpoint",
    "run_name_template": "%Y%m%d-%H%M%S-bad-validation",
    "validations": [
        {
            "batch_request": bad_batch_request,
            "expectation_suite_name": EXPECTATION_SUITE_NAME, # Using the same suite
        }
    ]
}

context.add_checkpoint(**bad_checkpoint_config)
print("Checkpoint 'my_bad_data_checkpoint' added for bad data validation.")

bad_checkpoint = context.get_checkpoint(name="my_bad_data_checkpoint")
bad_checkpoint_result = bad_checkpoint.run()

print("\n--- Bad Data Validation Execution Complete ---")
print(f"Validation successful: {bad_checkpoint_result.success}") # Should be False
print(f"Number of expectations run: {bad_checkpoint_result.statistics['evaluated_expectations']}")
print(f"Number of expectations passed: {bad_checkpoint_result.statistics['successful_expectations']}")
print(f"Number of expectations failed: {bad_checkpoint_result.statistics['unsuccessful_expectations']}")

print("\n--- Re-generating Data Docs to see new validation results (including failures) ---")
context.build_data_docs()
print("Data Docs re-built successfully. Check the report for failed expectations on 'bad_sample_data'.")
print(f"Re-open this path to view updated Data Docs: {data_docs_path}")

print("\n--- Task 2 Complete! ---")

ImportError: cannot import name 'DataContext' from 'great_expectations.data_context' (/home/vscode/.local/lib/python3.10/site-packages/great_expectations/data_context/__init__.py)

### Task 3: Advanced Expectations and Scheduling

1. Objective: Create advanced expectations for conditional checks and automate the validation.
2. Steps:
    - Define advanced expectations based on complex conditions.
    - Use scheduling tools to automate periodic checks.
    - E.g., an expectation that customer IDs must be unique and schedule a daily check.

In [None]:
# Write your code from here