### Task 1: Automated Data Profiling

**Steps**:
1. Using Pandas-Profiling
    - Generate a profile report for an existing CSV file.
    - Customize the profile report to include correlations.
    - Profile a specific subset of columns.
2. Using Great Expectations
    - Create a basic expectation suite for your data.
    - Validate data against an expectation suite.
    - Add multiple expectations to a suite.

In [1]:
import pandas as pd
from pandas_profiling import ProfileReport
import great_expectations as ge

# Sample dataset creation
data = {
    "age": [25, 30, 22, 40, 28],
    "income": [50000, 60000, 52000, 80000, 58000],
    "gender": ["M", "F", "F", "M", "F"],
    "score": [0.8, 0.6, 0.75, 0.9, 0.65]
}
df = pd.DataFrame(data)

# 1. Pandas-Profiling
profile = ProfileReport(df, correlations={"pearson": {"calculate": True}}, minimal=False)
profile.to_file("full_report.html")

subset_profile = ProfileReport(df[["age", "income"]], minimal=True)
subset_profile.to_file("subset_report.html")

# 2. Great Expectations
context = ge.get_context()
suite_name = "basic_suite"
try:
    suite = context.get_expectation_suite(suite_name)
except Exception:
    suite = context.create_expectation_suite(suite_name, overwrite_existing=True)

batch = ge.from_pandas(df)

batch.expect_column_to_exist("age")
batch.expect_column_values_to_be_between("age", min_value=18, max_value=65)
batch.expect_column_values_to_not_be_null("income")
batch.expect_column_values_to_be_in_set("gender", ["M", "F"])
batch.expect_column_values_to_be_between("score", min_value=0.0, max_value=1.0)

results = batch.validate()
print(results)


  from pandas_profiling import ProfileReport


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

  0%|          | 0/4 [00:00<?, ?it/s]


TypeError: NDFrame.infer_objects() got an unexpected keyword argument 'copy'

### Task 2: Real-time Monitoring of Data Quality

**Steps**:
1. Setting up Alerts for Quality Drops
    - Use the logging library to set up a basic alert on failed expectations.
    - Implementing alerts using email notifications.
    - Using a dashboard like Grafana for visual alerts.
        - Note: Example assumes integration with a monitoring system
        - Alert setup would involve creating a data source and alert rule in Grafana

In [2]:
import logging
import smtplib
from email.message import EmailMessage
import great_expectations as ge
import pandas as pd

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def send_email_alert(subject, body, to_email):
    msg = EmailMessage()
    msg.set_content(body)
    msg["Subject"] = subject
    msg["From"] = "alert@example.com"
    msg["To"] = to_email

    with smtplib.SMTP("localhost") as server:
        server.send_message(msg)

def run_data_quality_checks(df):
    batch = ge.from_pandas(df)
    batch.expect_column_values_to_not_be_null("age")
    batch.expect_column_values_to_be_between("age", min_value=18, max_value=65)
    results = batch.validate()

    if not results["success"]:
        logger.warning("Data quality check failed!")
        send_email_alert(
            subject="Data Quality Alert",
            body="Data quality checks have failed. Please investigate.",
            to_email="data.team@example.com"
        )
    else:
        logger.info("Data quality check passed.")

# Example usage with sample data
data = {"age": [25, None, 17, 45]}
df = pd.DataFrame(data)
run_data_quality_checks(df)


INFO:great_expectations.data_asset.data_asset:	2 expectation(s) included in expectation_suite.


ConnectionRefusedError: [Errno 111] Connection refused

### Task 3: Using AI for Data Quality Monitoring
**Steps**:
1. Basic AI Models for Monitoring
    - Train a simple anomaly detection model using Isolation Forest.
    - Use a simple custom function based AI logic for outlier detection.
    - Creating a monitoring function that utilizes a pre-trained machine learning model.

In [3]:
import numpy as np
import pandas as pd
from sklearn.ensemble import IsolationForest

data = np.array([[25, 50000], [30, 60000], [35, 75000], [40, None], [45, 100000]])
df = pd.DataFrame(data, columns=["age", "income"])
df["income"].fillna(df["income"].median(), inplace=True)

model = IsolationForest(contamination=0.1, random_state=42)
model.fit(df)

def detect_anomalies(dataframe, model):
    preds = model.predict(dataframe)
    dataframe["anomaly"] = preds
    anomalies = dataframe[dataframe["anomaly"] == -1]
    return anomalies

anomalies_detected = detect_anomalies(df, model)
print(anomalies_detected)









See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)
  return np.find_common_type(types, [])

See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)
  return np.find_common_type(types, [])



  age    income  anomaly
4  45  100000.0       -1
