### Task 1: Automated Data Profiling

**Steps**:
1. Using Pandas-Profiling
    - Generate a profile report for an existing CSV file.
    - Customize the profile report to include correlations.
    - Profile a specific subset of columns.
2. Using Great Expectations
    - Create a basic expectation suite for your data.
    - Validate data against an expectation suite.
    - Add multiple expectations to a suite.

In [1]:
%pip install pandas pandas-profiling great_expectations
import pandas as pd
from pandas_profiling import ProfileReport
import great_expectations as ge

# Load CSV file
csv_path = "your_data.csv"  # Replace with your CSV file path
df = pd.read_csv(csv_path)

# --- 1. Pandas Profiling ---

# Generate full profile with correlations
profile = ProfileReport(df, title="Full Data Profile", correlations={"pearson": {"calculate": True}})
profile.to_file("full_profile_report.html")
print("Full profile report saved as 'full_profile_report.html'.")

# Generate profile for a subset of columns
subset_columns = df.columns[:5].tolist()  # example: first 5 columns
profile_subset = ProfileReport(df[subset_columns], title="Subset Profile Report")
profile_subset.to_file("subset_profile_report.html")
print("Subset profile report saved as 'subset_profile_report.html'.")

# --- 2. Great Expectations ---

# Convert pandas df to GE dataframe
ge_df = ge.from_pandas(df)

# Create expectation suite (in-memory)
suite_name = "basic_expectations"
context = ge.get_context()  # Use default ephemeral context for demo

try:
    suite = context.create_expectation_suite(suite_name=suite_name, overwrite_existing=True)
except Exception:
    # If context doesn't support this, just create a new empty suite
    from great_expectations.core import ExpectationSuite
    suite = ExpectationSuite(suite_name=suite_name)

# Add multiple expectations
suite.add_expectation(
    {
        "expectation_type": "expect_table_columns_to_match_ordered_list",
        "kwargs": {"column_list": df.columns.tolist()},
    }
)
for col in df.columns:
    suite.add_expectation(
        {
            "expectation_type": "expect_column_values_to_not_be_null",
            "kwargs": {"column": col},
        }
    )
    if pd.api.types.is_numeric_dtype(df[col]):
        suite.add_expectation(
            {
                "expectation_type": "expect_column_mean_to_be_between",
                "kwargs": {"column": col, "min_value": df[col].mean() * 0.5, "max_value": df[col].mean() * 1.5},
            }
        )

print(f"Created expectation suite '{suite_name}' with multiple expectations.")

# Validate data against the suite
results = ge_df.validate(expectation_suite=suite)
print("Validation Results Summary:")
print(f"Success: {results['success']}")
for res in results['results']:
    if not res['success']:
        exp_type = res['expectation_config']['expectation_type']
        col = res['expectation_config']['kwargs'].get('column', 'N/A')
        print(f" - Failed expectation: {exp_type} on column: {col}")

Defaulting to user installation because normal site-packages is not writeable
Collecting pandas-profiling
  Downloading pandas_profiling-3.6.6-py2.py3-none-any.whl.metadata (4.5 kB)
Downloading pandas_profiling-3.6.6-py2.py3-none-any.whl (324 kB)
Installing collected packages: pandas-profiling
Successfully installed pandas-profiling-3.6.6
Note: you may need to restart the kernel to use updated packages.


PydanticImportError: `BaseSettings` has been moved to the `pydantic-settings` package. See https://docs.pydantic.dev/2.11/migration/#basesettings-has-moved-to-pydantic-settings for more details.

For further information visit https://errors.pydantic.dev/2.11/u/import-error

### Task 2: Real-time Monitoring of Data Quality

**Steps**:
1. Setting up Alerts for Quality Drops
    - Use the logging library to set up a basic alert on failed expectations.
    - Implementing alerts using email notifications.
    - Using a dashboard like Grafana for visual alerts.
        - Note: Example assumes integration with a monitoring system
        - Alert setup would involve creating a data source and alert rule in Grafana

In [2]:
import logging

# Configure logger
logger = logging.getLogger("DataQualityMonitor")
logger.setLevel(logging.INFO)  # or logging.WARNING to catch only alerts

# Console handler
console_handler = logging.StreamHandler()
console_handler.setLevel(logging.INFO)
formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s")
console_handler.setFormatter(formatter)
logger.addHandler(console_handler)

# Usage example
def log_quality_drop(message):
    logger.warning(f"DATA QUALITY ALERT: {message}")

### Task 3: Using AI for Data Quality Monitoring
**Steps**:
1. Basic AI Models for Monitoring
    - Train a simple anomaly detection model using Isolation Forest.
    - Use a simple custom function based AI logic for outlier detection.
    - Creating a monitoring function that utilizes a pre-trained machine learning model.

In [3]:
import numpy as np
import pandas as pd
from sklearn.ensemble import IsolationForest

# Sample data (some normal and some potential anomalies)
data = np.array([
    [25, 50000],
    [30, 60000],
    [35, 75000],
    [40, 80000],
    [45, 100000],
    [150, 200000],  # likely anomaly (age too high)
    [29, 58000],
    [33, 70000],
    [28, 62000],
    [22, 48000],
])

df = pd.DataFrame(data, columns=["Age", "Salary"])

# Step 1: Train Isolation Forest model for anomaly detection
def train_anomaly_detector(dataframe):
    model = IsolationForest(contamination=0.1, random_state=42)
    model.fit(dataframe)
    return model

# Step 2: Define a custom AI logic function for outlier detection (example: simple z-score)
def simple_outlier_detector(df, column, threshold=3):
    mean = df[column].mean()
    std = df[column].std()
    df[f"{column}_zscore"] = (df[column] - mean) / std
    df[f"{column}_outlier"] = df[f"{column}_zscore"].abs() > threshold
    return df

# Step 3: Monitoring function that uses the pre-trained model and custom logic
def monitor_data_quality(df, model):
    # Predict anomalies with Isolation Forest (-1 = anomaly)
    df["anomaly_flag"] = model.predict(df)
    
    # Apply custom outlier detection on numeric columns
    for col in df.select_dtypes(include=[np.number]).columns:
        df = simple_outlier_detector(df, col)

    # Collect anomaly records from Isolation Forest and custom logic
    anomalies_isolation_forest = df[df["anomaly_flag"] == -1]
    anomalies_custom = df[df[[col+"_outlier" for col in df.select_dtypes(include=[np.number]).columns]].any(axis=1)]

    print("\nAnomalies detected by Isolation Forest:")
    print(anomalies_isolation_forest)

    print("\nAnomalies detected by custom z-score logic:")
    print(anomalies_custom)

    # Alert if anomalies found
    if not anomalies_isolation_forest.empty or not anomalies_custom.empty:
        print("\n🚨 ALERT: Data quality anomalies detected!")

# Usage example
if __name__ == "__main__":
    model = train_anomaly_detector(df[["Age", "Salary"]])
    monitor_data_quality(df, model)

KeyError: "['Age_zscore_outlier', 'Salary_zscore_outlier', 'anomaly_flag_zscore_outlier'] not in index"