### Using Great Expectations for Automated Data Checks
**Objective**: Use Great Expectations to perform data validation steps on a dataset.

**Task 1**: Validate Column Existence

**Steps**:
- Load your dataset using a Pandas DataFrame.
- Use Great Expectations to setup an expectation suite.
- Create an expectation to confirm that a specific column (e.g., customer_id ) exists in your dataset.
- Run the expectation and observe the results.

In [1]:
# Install if needed:
# !pip install great_expectations pandas

import pandas as pd
import great_expectations as ge
from great_expectations.core.batch import BatchRequest
from great_expectations.datasource import Datasource
from great_expectations.data_context import BaseDataContext
from great_expectations.validator.validator import Validator
from great_expectations.datasource.types import BatchDefinition, BatchSpec

# Step 1: Create a sample DataFrame
data = {
    "customer_id": [101, 102, 103],
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35]
}
df = pd.DataFrame(data)

# Step 2: Create an in-memory Great Expectations DataContext (temp, not full project)
context = ge.get_context()

# Step 3: Create a Datasource from the DataFrame (pandas execution engine)
datasource_name = "my_pandas_datasource"
context.add_datasource(
    name=datasource_name,
    class_name="Datasource",
    execution_engine={
        "class_name": "PandasExecutionEngine"
    },
    data_connectors={
        "default_runtime_data_connector": {
            "class_name": "RuntimeDataConnector",
            "batch_identifiers": ["default_identifier_name"]
        }
    }
)

# Step 4: Create a batch request to pass the DataFrame
batch_request = {
    "datasource_name": datasource_name,
    "data_connector_name": "default_runtime_data_connector",
    "data_asset_name": "my_dataframe",  # this is a name you assign
    "runtime_parameters": {"batch_data": df},
    "batch_identifiers": {"default_identifier_name": "default_identifier"},
}

# Step 5: Create a validator for that batch request
validator = context.get_validator(batch_request=batch_request)

# Step 6: Create expectation suite (overwrite if exists)
expectation_suite_name = "customer_data_suite"
validator.expectation_suite = context.create_expectation_suite(expectation_suite_name, overwrite_existing=True)

# Step 7: Add expectation to check if 'customer_id' column exists
result = validator.expect_column_to_exist("customer_id")

# Step 8: Print result
print("Expectation result for column 'customer_id':")
print(result)

if result["success"]:
    print("\n✅ The column 'customer_id' exists in the dataset.")
else:
    print("\n❌ The column 'customer_id' does NOT exist in the dataset.")

KeyboardInterrupt: 

**Task 2**: Validate Column Data Types

**Steps**:
- Using the same dataset setup, create an expectation to check that a numeric column
(e.g., purchase_amount ) contains only float values.
- Identify a numeric column in your dataset.
- Use Great Expectations to create and validate an expectation that checks the column's data type is correct.
- Run your expectation and check if it passes for your data.

In [2]:
# Install if not already
# !pip install great_expectations pandas

import pandas as pd
import great_expectations as ge

# Step 1: Sample DataFrame
data = {
    "customer_id": [101, 102, 103],
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25.0, 30.5, 35.0],            # Numeric column (floats)
    "purchase_amount": [100.0, 200.5, 150.75]
}

df = pd.DataFrame(data)

# Step 2: Convert pandas DataFrame to Great Expectations PandasDataset
ge_df = ge.from_pandas(df)

# Step 3: Expect 'customer_id' column exists
result_col_exist = ge_df.expect_column_to_exist("customer_id")
print("Column 'customer_id' existence:", result_col_exist)

# Step 4: Validate 'age' column contains only numeric values (float or int)
# Using expect_column_values_to_be_in_type_list
result_age_type = ge_df.expect_column_values_to_be_in_type_list(
    "age", type_list=["float", "int", "numpy.float64", "numpy.int64"]
)
print("\nColumn 'age' data type check:", result_age_type)

# Step 5: Validate 'purchase_amount' column numeric type
result_purchase_type = ge_df.expect_column_values_to_be_in_type_list(
    "purchase_amount", type_list=["float", "int", "numpy.float64", "numpy.int64"]
)
print("\nColumn 'purchase_amount' data type check:", result_purchase_type)

# Interpretation
if result_age_type["success"]:
    print("\n✅ 'age' column contains only numeric values.")
else:
    print("\n❌ 'age' column contains non-numeric values.")

if result_purchase_type["success"]:
    print("\n✅ 'purchase_amount' column contains only numeric values.")
else:
    print("\n❌ 'purchase_amount' column contains non-numeric values.")

AttributeError: module 'great_expectations' has no attribute 'from_pandas'

**Task 3**: Validate Range of Values

**Steps**:
- Set an expectation using Great Expectations to ensure that a column (e.g., age ) values
are between 18 and 65.
- Identify a column in your dataset where values fall within a specific range.
- Implement a range-based expectation to check this column and validate your dataset.
- Observe and interpret the result of your expectation.

In [3]:
import pandas as pd
import great_expectations as ge

# Sample data
data = {
    "customer_id": [101, 102, 103],
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 70],  # Notice 70 is outside the expected range 18-65
    "purchase_amount": [100.0, 200.5, 150.75]
}

df = pd.DataFrame(data)

# Wrap the DataFrame with Great Expectations
ge_df = ge.from_pandas(df)

# Task 1 & 2 Recap: Check column existence and type (optional here)
print(ge_df.expect_column_to_exist("customer_id"))
print(ge_df.expect_column_values_to_be_in_type_list("age", ["int", "float"]))

# Task 3: Validate 'age' values are between 18 and 65
result_age_range = ge_df.expect_column_values_to_be_between(
    column="age",
    min_value=18,
    max_value=65,
    mostly=1.0  # 100% values must be within range to pass
)

print("\nAge range validation result:")
print(result_age_range)

# Interpret result
if result_age_range["success"]:
    print("\n✅ All 'age' values are within the range 18 to 65.")
else:
    print("\n❌ Some 'age' values are outside the range 18 to 65.")
    # Show rows violating the expectation
    invalid_ages = df[(df["age"] < 18) | (df["age"] > 65)]
    print("\nInvalid age entries:")
    print(invalid_ages)

AttributeError: module 'great_expectations' has no attribute 'from_pandas'