## Automated Data Quality Monitoring
**Objective**: Use Great Expectations to perform data profiling and write validation rules.

1. Data Profiling with Great Expectations
### Profile a CSV dataset containing customer information to inspect distribution patterns of 'Age' and 'Income' columns.
- Load the dataset using Great Expectations and create a data context.
- Generate a data asset to inspect the summary statistics.
- View the generated expectation suite to analyze data distributions.

In [4]:
import pandas as pd
import great_expectations as gx

# Sample DataFrame with Date column
df = pd.DataFrame({
    "Date": ["2023-01-01", "2023-02-30", "2023/03/15", "2023-04-10", "15-05-2023"],
    "Value": [100, 200, 150, 300, 250]
})

# Initialize in-memory (Ephemeral) Great Expectations context
context = gx.get_context()

# Create a batch request for pandas dataframe runtime batch
batch_request = {
    "datasource_name": "pandas_datasource",
    "data_connector_name": "default_runtime_data_connector_name",
    "data_asset_name": "my_data_asset",  # arbitrary name for this batch
    "runtime_parameters": {"batch_data": df},
    "batch_identifiers": {"default_identifier_name": "default_identifier"},
}

# Add a Pandas datasource (Fluent API)
try:
    context.add_datasource(
        name="pandas_datasource",
        class_name="Datasource",
        execution_engine={"class_name": "PandasExecutionEngine"},
        data_connectors={
            "default_runtime_data_connector_name": {
                "class_name": "RuntimeDataConnector",
                "batch_identifiers": ["default_identifier_name"],
            }
        },
    )
except Exception:
    # If datasource already exists, ignore error
    pass

# Create or get expectation suite
suite_name = "date_validation_suite"
try:
    context.create_expectation_suite(suite_name, overwrite_existing=True)
except Exception:
    pass

# Get a Validator for this batch request and suite
validator = context.get_validator(batch_request=batch_request, expectation_suite_name=suite_name)

# Add expectation: Date column matches regex YYYY-MM-DD
validator.expect_column_values_to_match_regex(
    column="Date",
    regex=r"^\d{4}-\d{2}-\d{2}$"
)

# Validate and get results
validation_results = validator.validate()

# Print summary of validation results
print(validation_results)

# Optionally, show detailed result for the 'Date' column expectation
for result in validation_results['results']:
    if result['expectation_config']['expectation_type'] == "expect_column_values_to_match_regex":
        print(result)

TypeError: batch_request must be a BatchRequest, RuntimeBatchRequest, or a fluent BatchRequest object, not <class 'dict'>

2. Writing Validation Rules for Data Ingestion
### Write validation rules for a CSV file to ensure the 'Date' column follows a specific date format.
- Utilize expect_column_values_to_match_regex to enforce date format validation.
- Run the validation and interpret the output.

In [None]:
# write your code fro
import pandas as pd
import great_expectations as gx

# Sample dataset with Date column
df = pd.DataFrame({
    "Date": ["2023-01-01", "2023-02-30", "2023/03/15", "2023-04-10", "15-05-2023"],
    "Value": [100, 200, 150, 300, 250]
})

# Create Great Expectations context
context = gx.get_context()

# Add Pandas datasource and DataFrame asset
datasource = context.sources.add_pandas(name="pandas_src")
data_asset = datasource.add_dataframe_asset(name="date_data", dataframe=df)

# Create or get expectation suite
expectation_suite = context.create_expectation_suite("date_validation_suite", overwrite_existing=True)

# Get validator
validator = context.get_validator(
    batch_request=data_asset.build_batch_request(),
    expectation_suite=expectation_suite,
)

# Add expectation for Date column format YYYY-MM-DD
validator.expect_column_values_to_match_regex(
    column="Date",
    regex=r"^\d{4}-\d{2}-\d{2}$"
)

# Validate data
results = validator.validate()

# Print validation results
print(results)


Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": false,
  "results": [
    {
      "success": false,
      "expectation_config": {
        "expectation_type": "expect_column_values_to_match_regex",
        "kwargs": {
          "column": "Date",
          "regex": "^\\d{4}-\\d{2}-\\d{2}$",
          "batch_id": "pandas_src-date_data"
        },
        "meta": {}
      },
      "result": {
        "element_count": 5,
        "unexpected_count": 2,
        "unexpected_percent": 40.0,
        "partial_unexpected_list": [
          "2023/03/15",
          "15-05-2023"
        ],
        "missing_count": 0,
        "missing_percent": 0.0,
        "unexpected_percent_total": 40.0,
        "unexpected_percent_nonmissing": 40.0
      },
      "meta": {},
      "exception_info": {
        "raised_exception": false,
        "exception_traceback": null,
        "exception_message": null
      }
    }
  ],
  "evaluation_parameters": {},
  "statistics": {
    "evaluated_expectations": 1,
    "successful_expectations": 0,
    "unsu