## Architecture to Monitor Data Quality Over Time

**Description**: Design a monitoring system in Python that checks and logs data quality metrics (accuracy, completeness) for a dataset over time.

**Steps to follow:**
1. Implement a Scheduled Script:
    - Use schedule library to periodically run a script.
2. Script to Calculate Metrics:
    - For simplicity, use a function calculate_quality_metrics() that calculates and logs metrics such as missing rate or mismatch rate.
3. Store Logs:
    - Use Python's logging library to save these metrics over time.

In [16]:
!pip3 install pandas schedule


Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [17]:
import pandas as pd
import schedule
import time
import os
import logging
from IPython.display import display, Markdown

# Ensure logs folder exists
os.makedirs("logs", exist_ok=True)

# Setup logging
log_file = "logs/quality.log"
logging.basicConfig(
    filename=log_file,
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(message)s"
)

# Expected column types
EXPECTED_TYPES = {
    'name': str,
    'age': int,
    'salary': float
}

def calculate_quality_metrics(df, expected_types):
    metrics = {}

    missing_rates = df.isnull().mean().to_dict()
    metrics['missing_rate'] = {k: round(v, 3) for k, v in missing_rates.items()}

    mismatch_rate = {}
    for col, expected_type in expected_types.items():
        if col in df.columns:
            mismatches = df[col].apply(lambda x: not isinstance(x, expected_type) if pd.notnull(x) else False)
            mismatch_rate[col] = round(mismatches.mean(), 3)
    metrics['mismatch_rate'] = mismatch_rate

    return metrics


In [21]:
def monitor_data_quality():
    try:
        df = pd.read_csv("/workspaces/AI_DATA_ANALYSIS_/src/Module 7/Measuring Data Accuracy, Completeness & Consistency/data.csv")
        metrics = calculate_quality_metrics(df, EXPECTED_TYPES)
        logging.info(f"Quality Metrics: {metrics}")
        display(Markdown(f"**Logged:** `{metrics}`"))
    except Exception as e:
        logging.error(f"Monitoring failed: {e}")
        display(Markdown(f"**Error:** {e}"))


In [19]:
# Run this cell to start scheduled monitoring
from threading import Thread

def run_scheduler():
    schedule.every(30).seconds.do(monitor_data_quality)
    while True:
        schedule.run_pending()
        time.sleep(1)

# Start the scheduler in a thread
Thread(target=run_scheduler, daemon=True).start()
display(Markdown("✅ **Scheduler started. Monitoring every 30 seconds...**"))


✅ **Scheduler started. Monitoring every 30 seconds...**

In [28]:
with open("/workspaces/AI_DATA_ANALYSIS_/src/Module 7/Measuring Data Accuracy, Completeness & Consistency/data_quality.log", "r") as f:
    print(f.read())


2025-05-25 05:28:25,235 - INFO - Data Quality Metrics: {'id_completeness': 1.0, 'name_completeness': 0.75, 'age_completeness': 0.75, 'status_completeness': 1.0, 'status_accuracy': 1.0}
2025-05-25 05:29:32,132 - INFO - Data Quality Metrics: {'id_completeness': 1.0, 'name_completeness': 0.75, 'age_completeness': 0.75, 'status_completeness': 1.0, 'status_accuracy': 1.0}
2025-05-25 05:31:26,855 - INFO - Data Quality Metrics: {'id_completeness': 1.0, 'name_completeness': 0.75, 'age_completeness': 0.75, 'status_completeness': 1.0, 'status_accuracy': 1.0}
2025-05-25 05:35:12,674 - INFO - Quality Metrics: {'missing_rate': {'name': 0.143, 'age': 0.143, 'salary': 0.143}, 'mismatch_rate': {'name': 0.0, 'age': 0.857, 'salary': 0.857}}
2025-05-25 05:35:42,711 - INFO - Quality Metrics: {'missing_rate': {'name': 0.143, 'age': 0.143, 'salary': 0.143}, 'mismatch_rate': {'name': 0.0, 'age': 0.857, 'salary': 0.857}}
2025-05-25 05:36:12,746 - INFO - Quality Metrics: {'missing_rate': {'name': 0.143, 'age':

In [24]:
monitor_data_quality()


**Logged:** `{'missing_rate': {'name': 0.143, 'age': 0.143, 'salary': 0.143}, 'mismatch_rate': {'name': 0.0, 'age': 0.857, 'salary': 0.857}}`