### Task 1: Automated Data Profiling

**Steps**:
1. Using Pandas-Profiling
    - Generate a profile report for an existing CSV file.
    - Customize the profile report to include correlations.
    - Profile a specific subset of columns.
2. Using Great Expectations
    - Create a basic expectation suite for your data.
    - Validate data against an expectation suite.
    - Add multiple expectations to a suite.

In [16]:

import pandas as pd
import numpy as np
from ydata_profiling import ProfileReport
import great_expectations as gx

# 1. Generate sample dataset
np.random.seed(42)
data = {
    'age': np.random.randint(18, 70, 100),
    'income': np.random.normal(50000, 15000, 100).round(2),
    'email': [f'user{i}@example.com' for i in range(100)],
    'gender': np.random.choice(['Male', 'Female'], 100),
    'purchase_amount': np.random.uniform(10, 500, 100).round(2),
    'signup_date': pd.date_range(start='2023-01-01', periods=100, freq='D')
}
df = pd.DataFrame(data)

# 2. Generate pandas-profiling report with correlations
profile = ProfileReport(
    df,
    title="Data Profiling Report",
    correlations={
        "pearson": {"calculate": True},
        "spearman": {"calculate": False},  # disable spearman to avoid known issues
        "kendall": {"calculate": False},
        "phi_k": {"calculate": False},
        "cramers": {"calculate": False}
    }
)

# Display profile report inline (Jupyter) or save as fallback
try:
    profile.to_notebook_iframe()
except Exception as e:
    print(f"Could not display profile inline due to: {e}")
    profile.to_file("data_profile.html")
    print("Profile saved to data_profile.html")

# 3. Initialize Great Expectations context
context = gx.get_context()

# 4. Generate Expectation Suite automatically from profile report
expectation_suite_name = "auto_generated_suite"
suite = profile.to_expectation_suite(suite_name=expectation_suite_name)

# 5. Save the suite to Great Expectations context
context.save_expectation_suite(suite, expectation_suite_name)

# 6. Create Validator and validate the DataFrame against the generated suite
validator = context.get_validator(
    datasource_name=None,
    data_asset_name="sample_data_asset",
    batch_data=df,
    expectation_suite_name=expectation_suite_name,
)

validation_result = validator.validate()

# 7. Print validation summary
print("\nGreat Expectations Validation Results:")
print(f"Success: {validation_result['success']}")
for result in validation_result["results"]:
    exp_type = result["expectation_config"]["expectation_type"]
    column = result["expectation_config"]["kwargs"].get("column", "")
    success = result["success"]
    print(f" - {exp_type} on column '{column}': {'Passed' if success else 'Failed'}")


 50%|█████     | 3/6 [00:00<00:00, 110.93it/s]0<00:00, 27.15it/s, Describe variable: signup_date]    
Summarize dataset:  45%|████▌     | 5/11 [00:00<00:00, 31.30it/s, Describe variable: signup_date]


Could not display profile inline due to: 'float' object has no attribute 'ndim'


100%|██████████| 6/6 [00:00<00:00, 53.10it/s]00<00:00, 26.09it/s, Describe variable: signup_date]    
Summarize dataset: 100%|██████████| 25/25 [00:01<00:00, 15.62it/s, Completed]                               
Generate report structure: 100%|██████████| 1/1 [00:02<00:00,  2.28s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  2.11it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 535.06it/s]

Profile saved to data_profile.html





AttributeError: module 'great_expectations.data_context' has no attribute 'DataContext'

In [17]:
import pandas as pd
import numpy as np
from ydata_profiling import ProfileReport
import great_expectations as gx

# 1. Generate sample dataset
np.random.seed(42)
data = {
    'age': np.random.randint(18, 70, 100),
    'income': np.random.normal(50000, 15000, 100).round(2),
    'email': [f'user{i}@example.com' for i in range(100)],
    'gender': np.random.choice(['Male', 'Female'], 100),
    'purchase_amount': np.random.uniform(10, 500, 100).round(2),
    'signup_date': pd.date_range(start='2023-01-01', periods=100, freq='D')
}
df = pd.DataFrame(data)

# 2. Generate pandas-profiling report with correlations
profile = ProfileReport(
    df,
    title="Data Profiling Report",
    correlations={
        "pearson": {"calculate": True},
        "spearman": {"calculate": False},
        "kendall": {"calculate": False},
        "phi_k": {"calculate": False},
        "cramers": {"calculate": False}
    }
)

try:
    profile.to_notebook_iframe()
except Exception as e:
    print(f"Could not display profile inline due to: {e}")
    profile.to_file("data_profile.html")
    print("Profile saved to data_profile.html")

# 3. Initialize Great Expectations context
context = gx.get_context()

# 4. Create or load expectation suite manually
expectation_suite_name = "manual_suite"
try:
    suite = context.get_expectation_suite(expectation_suite_name)
except gx.exceptions.DataContextError:
    suite = context.create_expectation_suite(expectation_suite_name, overwrite_existing=True)

# Clear existing expectations
suite.expectations = []

# 5. Add expectations manually based on profiling insights
suite.add_expectation({
    "expectation_type": "expect_column_values_to_not_be_null",
    "kwargs": {"column": "age"}
})
suite.add_expectation({
    "expectation_type": "expect_column_values_to_be_unique",
    "kwargs": {"column": "email"}
})
suite.add_expectation({
    "expectation_type": "expect_column_values_to_be_between",
    "kwargs": {"column": "age", "min_value": 18, "max_value": 70}
})
suite.add_expectation({
    "expectation_type": "expect_column_values_to_match_regex",
    "kwargs": {"column": "email", "regex": r"[^@]+@[^@]+\.[^@]+"}
})

context.save_expectation_suite(suite, expectation_suite_name)

# 6. Create Validator and validate the DataFrame
validator = context.get_validator(
    datasource_name=None,
    data_asset_name="sample_data_asset",
    batch_data=df,
    expectation_suite_name=expectation_suite_name,
)

validation_result = validator.validate()

print("\nGreat Expectations Validation Results:")
print(f"Success: {validation_result['success']}")
for result in validation_result["results"]:
    exp_type = result["expectation_config"]["expectation_type"]
    column = result["expectation_config"]["kwargs"].get("column", "")
    success = result["success"]
    print(f" - {exp_type} on column '{column}': {'Passed' if success else 'Failed'}")


 50%|█████     | 3/6 [00:00<00:00, 115.61it/s]0<00:00, 94.54it/s, Describe variable: signup_date]    
Summarize dataset:  45%|████▌     | 5/11 [00:00<00:00, 90.66it/s, Describe variable: signup_date]


Could not display profile inline due to: 'float' object has no attribute 'ndim'


100%|██████████| 6/6 [00:00<00:00, 248.64it/s]0<00:00, 96.97it/s, Describe variable: signup_date]     
Summarize dataset: 100%|██████████| 25/25 [00:01<00:00, 18.48it/s, Completed]                               
Generate report structure: 100%|██████████| 1/1 [00:02<00:00,  2.24s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  3.05it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 407.49it/s]

Profile saved to data_profile.html





AttributeError: 'EphemeralDataContext' object has no attribute 'get_expectation_suite'

### Task 2: Real-time Monitoring of Data Quality

**Steps**:
1. Setting up Alerts for Quality Drops
    - Use the logging library to set up a basic alert on failed expectations.
    - Implementing alerts using email notifications.
    - Using a dashboard like Grafana for visual alerts.
        - Note: Example assumes integration with a monitoring system
        - Alert setup would involve creating a data source and alert rule in Grafana

In [None]:
# Write your code from here

### Task 3: Using AI for Data Quality Monitoring
**Steps**:
1. Basic AI Models for Monitoring
    - Train a simple anomaly detection model using Isolation Forest.
    - Use a simple custom function based AI logic for outlier detection.
    - Creating a monitoring function that utilizes a pre-trained machine learning model.