## Architecture to Monitor Data Quality Over Time

**Description**: Design a monitoring system in Python that checks and logs data quality metrics (accuracy, completeness) for a dataset over time.

**Steps to follow:**
1. Implement a Scheduled Script:
    - Use schedule library to periodically run a script.
2. Script to Calculate Metrics:
    - For simplicity, use a function calculate_quality_metrics() that calculates and logs metrics such as missing rate or mismatch rate.
3. Store Logs:
    - Use Python's logging library to save these metrics over time.

In [1]:
# Write your code from here
import pandas as pd
import schedule
import time
import logging
from datetime import datetime

# Setup logging
logging.basicConfig(
    filename="data_quality.log",
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)

def calculate_quality_metrics(file_path):
    """
    Load dataset and calculate data quality metrics:
    - Missing data rate (completeness)
    - Example mismatch rate (accuracy) between two columns if applicable
    """

    try:
        df = pd.read_csv(file_path)

        # Completeness: overall missing rate (%)
        missing_rate = df.isnull().mean().mean() * 100

        # Accuracy example: mismatch rate between two columns if exist (optional)
        if "value_1" in df.columns and "value_2" in df.columns:
            mismatch_rate = (df["value_1"] != df["value_2"]).mean() * 100
        else:
            mismatch_rate = None

        # Log the metrics
        log_msg = f"Missing Rate: {missing_rate:.2f}%"
        if mismatch_rate is not None:
            log_msg += f", Mismatch Rate: {mismatch_rate:.2f}%"

        logging.info(log_msg)
        print(log_msg)  # Optional: print to console

    except Exception as e:
        logging.error(f"Error calculating metrics: {e}")
        print(f"Error: {e}")

def job():
    print(f"Running data quality check at {datetime.now()}")
    calculate_quality_metrics("your_dataset.csv")

# Schedule the job every day at 10:00 AM
schedule.every().day.at("10:00").do(job)

print("Data quality monitoring started...")
while True:
    schedule.run_pending()
    time.sleep(30)



Data quality monitoring started...


KeyboardInterrupt: 