### Task 1: Automated Data Profiling

**Steps**:
1. Using Pandas-Profiling
    - Generate a profile report for an existing CSV file.
    - Customize the profile report to include correlations.
    - Profile a specific subset of columns.
2. Using Great Expectations
    - Create a basic expectation suite for your data.
    - Validate data against an expectation suite.
    - Add multiple expectations to a suite.

In [2]:
# Write your code from here
!pip install ydata-profiling


Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [6]:
import pandas as pd
from ydata_profiling import ProfileReport

# Load data
df = pd.read_csv("/workspaces/AI_DATA_ANALYSIS_/src/Module 8/Automating Data Quality Measurement/data.csv")

# Generate profile report
profile = ProfileReport(df, title="Full Data Profiling Report", explorative=True)
profile.to_file("full_profile_report.html")


 20%|██        | 1/5 [00:00<00:00, 39.99it/s]00<00:00, 16.63it/s, Describe variable: Registered]
Summarize dataset:  30%|███       | 3/10 [00:00<00:00, 31.20it/s, Describe variable: Registered]


AttributeError: 'float' object has no attribute 'ndim'

In [7]:
from ydata_profiling import ProfileReport

profile = ProfileReport(
    df,
    correlations={
        "pearson": True,
        "spearman": True,
        "kendall": True,
        "phi_k": False,  # phi_k correlation off if used
        "cramers": False  # disable categorical chi-square based correlation
    },
    vars={"cat": {"chi_squared_threshold": 0.0}},  # disable chi-square test on categorical vars
    explorative=True
)
profile.to_file("profile_report_no_chisq.html")


TypeError: argument of type 'bool' is not iterable

In [5]:
!pip install --upgrade ydata-profiling scipy numpy pandas


Defaulting to user installation because normal site-packages is not writeable
Collecting numpy
  Downloading numpy-2.2.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.8/16.8 MB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting pandas
  Using cached pandas-2.2.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.1 MB)
Collecting numpy
  Downloading numpy-2.1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.3/16.3 MB[0m [31m29.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: numpy, pandas
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4
  Attempting uninstall: pandas
    Found existing installation: pandas 2.1.4
    Uninstalling pandas-2.1.4:
 

### Task 2: Real-time Monitoring of Data Quality

**Steps**:
1. Setting up Alerts for Quality Drops
    - Use the logging library to set up a basic alert on failed expectations.
    - Implementing alerts using email notifications.
    - Using a dashboard like Grafana for visual alerts.
        - Note: Example assumes integration with a monitoring system
        - Alert setup would involve creating a data source and alert rule in Grafana

In [8]:
# Write your code from here
import logging
from great_expectations.core.batch import BatchRequest
from great_expectations.data_context import DataContext

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Initialize Data Context (make sure you have GE set up)
context = DataContext()

# Define batch request (customize for your data source)
batch_request = BatchRequest(
    datasource_name="your_datasource",
    data_connector_name="your_data_connector",
    data_asset_name="your_data_asset",
)

# Run validation
result = context.run_validation_operator(
    "action_list_operator",  # default operator to run expectations and actions
    assets_to_validate=[batch_request]
)

# Check for validation result and log failures
if not result["success"]:
    logger.error("Data quality validation failed!")
    for res in result['run_results'].values():
        validation_result = res['validation_result']
        for evr in validation_result["results"]:
            if not evr["success"]:
                logger.error(f"Failed expectation: {evr['expectation_config']['expectation_type']} on {evr['expectation_config']['kwargs']}")
else:
    logger.info("Data quality validation passed successfully.")


ImportError: cannot import name 'DataContext' from 'great_expectations.data_context' (/home/vscode/.local/lib/python3.10/site-packages/great_expectations/data_context/__init__.py)

### Task 3: Using AI for Data Quality Monitoring
**Steps**:
1. Basic AI Models for Monitoring
    - Train a simple anomaly detection model using Isolation Forest.
    - Use a simple custom function based AI logic for outlier detection.
    - Creating a monitoring function that utilizes a pre-trained machine learning model.

In [9]:
# Write your code from here
import pandas as pd
from sklearn.ensemble import IsolationForest

# Load your data
df = pd.read_csv("data.csv")

# Select numeric columns for anomaly detection
numeric_cols = df.select_dtypes(include=["float64", "int64"]).columns
X = df[numeric_cols]

# Train Isolation Forest model
model = IsolationForest(contamination=0.05, random_state=42)
model.fit(X)

# Predict anomalies (-1 means anomaly, 1 means normal)
df['anomaly'] = model.predict(X)

# Mark anomalies for review
anomalies = df[df['anomaly'] == -1]
print(f"Number of anomalies detected: {len(anomalies)}")
print(anomalies)


Number of anomalies detected: 1
      Name  Age Gender  Score  Registered  anomaly
2  Charlie  NaN   Male   78.0  2023-01-20       -1


In [10]:
from scipy.stats import zscore

def detect_outliers_zscore(df, threshold=3):
    numeric_cols = df.select_dtypes(include=["float64", "int64"]).columns
    z_scores = df[numeric_cols].apply(zscore)
    outliers = (z_scores.abs() > threshold)
    return outliers

outlier_flags = detect_outliers_zscore(df)
print(outlier_flags.sum())  # count of outlier cells


Age        0
Score      0
anomaly    0
dtype: int64
