## Architecture to Monitor Data Quality Over Time

**Description**: Design a monitoring system in Python that checks and logs data quality metrics (accuracy, completeness) for a dataset over time.

**Steps to follow:**
1. Implement a Scheduled Script:
    - Use schedule library to periodically run a script.
2. Script to Calculate Metrics:
    - For simplicity, use a function calculate_quality_metrics() that calculates and logs metrics such as missing rate or mismatch rate.
3. Store Logs:
    - Use Python's logging library to save these metrics over time.

In [1]:
# Write your code from here
! pip install schedule


Defaulting to user installation because normal site-packages is not writeable
Collecting schedule
  Downloading schedule-1.2.2-py3-none-any.whl (12 kB)
Installing collected packages: schedule
Successfully installed schedule-1.2.2

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:
import pandas as pd
import numpy as np
import logging
import schedule
import time
from datetime import datetime

# Step 1: Setup Logging
logging.basicConfig(
    filename="data_quality_monitor.log",
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)

# Step 2: Simulated Dataset Loader
def load_data():
    # Simulate changing dataset (for demonstration purposes)
    data = {
        "transaction_id": [f"T{i}" if i % 7 != 0 else np.nan for i in range(1, 21)],
        "amount": [np.random.uniform(10, 500) if i % 8 != 0 else np.nan for i in range(1, 21)],
        "date": [f"2024-09-{(i % 30) + 1:02d}" if i % 6 != 0 else np.nan for i in range(1, 21)],
        "price_verified": [True if i % 4 != 0 else False for i in range(1, 21)],
    }
    return pd.DataFrame(data)

# Step 3: Quality Metrics Calculator
def calculate_quality_metrics():
    df = load_data()

    total_rows = len(df)
    missing_transaction_id = df['transaction_id'].isnull().sum()
    missing_amount = df['amount'].isnull().sum()
    missing_date = df['date'].isnull().sum()

    completeness_transaction_id = 100 * (1 - missing_transaction_id / total_rows)
    completeness_amount = 100 * (1 - missing_amount / total_rows)
    completeness_date = 100 * (1 - missing_date / total_rows)

    accuracy_verified = df['price_verified'].mean() * 100

    log_message = (
        f"Completeness - Transaction ID: {completeness_transaction_id:.2f}%, "
        f"Amount: {completeness_amount:.2f}%, Date: {completeness_date:.2f}%. "
        f"Accuracy - Price Verified: {accuracy_verified:.2f}%"
    )

    logging.info(log_message)
    print(f"[{datetime.now().strftime('%H:%M:%S')}] âœ… Data quality metrics logged.")

# Step 4: Schedule to Run Every 1 Minute (adjustable)
schedule.every(1).minutes.do(calculate_quality_metrics)

# Step 5: Run the Scheduler
print("ðŸ“Š Starting data quality monitor... (press Ctrl+C to stop)")
try:
    while True:
        schedule.run_pending()
        time.sleep(1)
except KeyboardInterrupt:
    print("\nðŸ›‘ Stopped monitoring.")


ðŸ“Š Starting data quality monitor... (press Ctrl+C to stop)
[16:16:07] âœ… Data quality metrics logged.
[16:17:07] âœ… Data quality metrics logged.
[16:18:07] âœ… Data quality metrics logged.
