### Task 1: Automated Data Profiling

**Steps**:
1. Using Pandas-Profiling
    - Generate a profile report for an existing CSV file.
    - Customize the profile report to include correlations.
    - Profile a specific subset of columns.
2. Using Great Expectations
    - Create a basic expectation suite for your data.
    - Validate data against an expectation suite.
    - Add multiple expectations to a suite.

In [5]:
# Install packages if needed
# !pip install pandas-profiling great_expectations scikit-learn

import pandas as pd
from pandas_profiling import ProfileReport
import great_expectations as ge

# Load your CSV file (replace 'your_file.csv' with your actual CSV file path)
df = pd.read_csv('data.csv')

# 1. Pandas-Profiling: Generate full profile with correlations
profile = ProfileReport(df, title="Pandas Profiling Report", correlations={"pearson": {"calculate": True}})
profile.to_notebook_iframe()  # Display inline in Jupyter

# 2. Profile a specific subset of columns
subset_cols = ['age', 'income']  # example column subset, update as per your data
profile_subset = ProfileReport(df[subset_cols], title="Subset Profile Report")
profile_subset.to_notebook_iframe()

# 3. Great Expectations: Create GE DataFrame from pandas DataFrame
ge_df = ge.from_pandas(df)

# 4. Create an expectation suite and add multiple expectations
ge_df.expect_column_values_to_not_be_null('age')
ge_df.expect_column_values_to_be_between('income', min_value=0)
ge_df.expect_column_values_to_be_unique('id')  # assuming there's an 'id' column

# 5. Validate the dataframe with all expectations
validation_result = ge_df.validate()

print(validation_result)


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

 43%|████▎     | 3/7 [00:00<00:00, 178.11it/s]


AttributeError: 'float' object has no attribute 'ndim'

### Task 2: Real-time Monitoring of Data Quality

**Steps**:
1. Setting up Alerts for Quality Drops
    - Use the logging library to set up a basic alert on failed expectations.
    - Implementing alerts using email notifications.
    - Using a dashboard like Grafana for visual alerts.
        - Note: Example assumes integration with a monitoring system
        - Alert setup would involve creating a data source and alert rule in Grafana

In [6]:
import logging
import smtplib
from email.message import EmailMessage

# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger('data_quality_monitor')

# Simple alert function to send email
def send_email_alert(subject, body, to_email):
    msg = EmailMessage()
    msg.set_content(body)
    msg['Subject'] = subject
    msg['From'] = 'your_email@example.com'  # Change this
    msg['To'] = to_email

    # Configure your SMTP server and credentials
    smtp_server = 'smtp.example.com'
    smtp_port = 587
    smtp_user = 'your_email@example.com'
    smtp_password = 'your_password'

    try:
        with smtplib.SMTP(smtp_server, smtp_port) as server:
            server.starttls()
            server.login(smtp_user, smtp_password)
            server.send_message(msg)
        logger.info("Alert email sent successfully!")
    except Exception as e:
        logger.error(f"Failed to send email: {e}")

# Example: Using GE validation to trigger alert
if not validation_result['success']:
    logger.warning("Data quality check failed!")
    send_email_alert(
        subject="Data Quality Alert",
        body="One or more data quality checks have failed. Please investigate.",
        to_email="recipient@example.com"
    )
else:
    logger.info("Data quality check passed.")

# Note: For Grafana integration, push GE validation results to a database or monitoring tool
# and create alert rules on Grafana dashboards (this is system/configuration work, not shown here)


NameError: name 'validation_result' is not defined

### Task 3: Using AI for Data Quality Monitoring
**Steps**:
1. Basic AI Models for Monitoring
    - Train a simple anomaly detection model using Isolation Forest.
    - Use a simple custom function based AI logic for outlier detection.
    - Creating a monitoring function that utilizes a pre-trained machine learning model.

In [7]:
from sklearn.ensemble import IsolationForest
import numpy as np

# Prepare data - example numeric columns only and fill NA
numeric_cols = df.select_dtypes(include=[np.number]).columns
df_numeric = df[numeric_cols].fillna(df[numeric_cols].median())

# Train Isolation Forest for anomaly detection
iso_forest = IsolationForest(contamination=0.1, random_state=42)
iso_forest.fit(df_numeric)

# Predict anomalies: -1 is anomaly, 1 is normal
df['anomaly'] = iso_forest.predict(df_numeric)

# Custom function based AI logic - flag rows with anomaly = -1
def monitor_data_quality(dataframe):
    anomalies = dataframe[dataframe['anomaly'] == -1]
    if not anomalies.empty:
        print(f"Warning: {len(anomalies)} anomalies detected!")
        return False
    else:
        print("No anomalies detected, data quality is good.")
        return True

# Use the monitoring function
monitor_data_quality(df)

# You could extend this to save results, send alerts, or trigger pipeline steps




False