## Automated Data Quality Monitoring
**Objective**: Use Great Expectations to perform data profiling and write validation rules.

1. Data Profiling with Great Expectations

### Profile a JSON dataset with product sales data to check for null values in the 'ProductID' and 'Price' fields.
- Create an expectation suite and connect it to the data context.
- Use the `expect_column_values_to_not_be_null` expectation to profile these fields.
- Review the summary to identify any unexpected null values.

In [2]:
# write your code from here
# Step 0: Install Great Expectations if not installed
# !pip install great_expectations

import great_expectations as ge
from great_expectations.data_context import DataContext
import os
import json

# Step 1: Create a sample JSON dataset manually
json_data = [
    {"ProductID": 101, "Price": 19.99, "ProductName": "Widget A"},
    {"ProductID": 102, "Price": 29.99, "ProductName": "Widget B"},
    {"ProductID": None, "Price": 15.99, "ProductName": "Widget C"},  # Null ProductID
    {"ProductID": 104, "Price": None, "ProductName": "Widget D"},   # Null Price
]

json_file = "product_sales.json"
with open(json_file, "w") as f:
    json.dump(json_data, f, indent=2)

print(f"Created JSON file: {json_file}")

# Step 2: Initialize Great Expectations Data Context
context = DataContext()

# Step 3: Create an Expectation Suite
suite_name = "product_sales_suite"
suite = context.create_expectation_suite(expectation_suite_name=suite_name, overwrite_existing=True)

# Step 4: Load data as a Great Expectations Dataset
# Using pandas dataset inside GE for simplicity
import pandas as pd
df = pd.read_json(json_file)
ge_df = ge.from_pandas(df)

# Step 5: Add expectations to check for non-null ProductID and Price
ge_df.expect_column_values_to_not_be_null("ProductID")
ge_df.expect_column_values_to_not_be_null("Price")

# Step 6: Save the expectation suite
ge_df.save_expectation_suite(suite_name=suite_name, data_context=context)

# Step 7: Validate the dataset against the expectation suite
results = context.run_validation_operator(
    "action_list_operator",
    assets_to_validate=[ge_df],
    run_name="validation_run_1",
)

# Step 8: Print summary of validation results
print("\nValidation Results Summary:")
print(f"Success: {results["success"]}")
for result in results["results"]:
    expectation = result["expectation_config"]
    outcome = result["success"]
    print(f"Expectation '{expectation['expectation_type']}' on column '{expectation['kwargs']['column']}' passed? {outcome}")



SyntaxError: f-string: unmatched '[' (1060181206.py, line 53)

2. Writing Validation Rules for Data Ingestion

### Define validation rules for an API data source to confirm that 'Status' field contains only predefined statuses ('Active', 'Inactive').

- Apply `expect_column_values_to_be_in_set` to check field values during data ingestion.
- Execute the validation and review any mismatches.

In [None]:
# write your code from here
import great_expectations as ge
from great_expectations.data_context import DataContext
from great_expectations.core.batch import BatchRequest

# Initialize Great Expectations DataContext (loads or creates GE project)
context = ge.data_context.DataContext()

# Define datasource name
datasource_name = "json_datasource"

# Register datasource if not already exists (adjust base_directory for your JSON files)
if datasource_name not in context.list_datasources():
    context.add_datasource(
        name=datasource_name,
        class_name="Datasource",
        execution_engine={
            "class_name": "PandasExecutionEngine"
        },
        data_connectors={
            "default_runtime_data_connector_name": {
                "class_name": "RuntimeDataConnector",
                "batch_identifiers": ["default_identifier_name"]
            }
        }
    )

# Create a runtime batch request to load the JSON file
batch_request = BatchRequest(
    datasource_name=datasource_name,
    data_connector_name="default_runtime_data_connector_name",
    data_asset_name="sales_data",  # arbitrary name
    runtime_parameters={"path": "./sales_data.json"},
    batch_identifiers={"default_identifier_name": "default_identifier"}
)

# Create or load an expectation suite
expectation_suite_name = "sales_data_null_check_suite"
try:
    suite = context.get_expectation_suite(expectation_suite_name)
except Exception:
    suite = context.create_expectation_suite(expectation_suite_name, overwrite_existing=True)

# Get validator for the batch
validator = context.get_validator(
    batch_request=batch_request,
    expectation_suite_name=expectation_suite_name
)

# Add expectations for non-null values in 'ProductID' and 'Price'
validator.expect_column_values_to_not_be_null("ProductID")
validator.expect_column_values_to_not_be_null("Price")

# Save the expectation suite
validator.save_expectation_suite()

# Run validation
results = validator.validate()

# Print validation summary
print(f"Validation Success: {results.success}")
print("Unexpected counts for nulls:")
for res in results.results:
    if "unexpected_count" in res["result"]:
        column = res["expectation_config"]["kwargs"]["column"]
        print(f"{column}: {res['result']['unexpected_count']} null values found")


ImportError: cannot import name 'DataContext' from 'great_expectations.data_context' (/home/vscode/.local/lib/python3.10/site-packages/great_expectations/data_context/__init__.py)