## Architecture to Monitor Data Quality Over Time

**Description**: Design a monitoring system in Python that checks and logs data quality metrics (accuracy, completeness) for a dataset over time.

**Steps to follow:**
1. Implement a Scheduled Script:
    - Use schedule library to periodically run a script.
2. Script to Calculate Metrics:
    - For simplicity, use a function calculate_quality_metrics() that calculates and logs metrics such as missing rate or mismatch rate.
3. Store Logs:
    - Use Python's logging library to save these metrics over time.

In [None]:
import pandas as pd
import schedule
import time
import logging

# Step 1: Set up the logging configuration
logging.basicConfig(filename='data_quality_log.txt', level=logging.INFO,
                    format='%(asctime)s - %(message)s')

# Sample dataset for demonstration
data = {
    'transaction_id': [1, 2, 3, None, 5],
    'amount': [100.0, 200.5, None, 400.0, 500.5],
    'date': ['2023-01-01', '2023-01-02', None, '2023-01-04', '2023-01-05']
}

# Creating a DataFrame for the example dataset
df = pd.DataFrame(data)

# Step 2: Function to calculate quality metrics
def calculate_quality_metrics(df):
    # Calculate missing data rate for each column
    missing_data_rate = df.isnull().mean() * 100

    # Calculate mismatch rate (assuming we compare 'amount' with a trusted source, here we use a dummy check)
    expected_amount = [100.0, 200.5, 300.0, 400.0, 500.0]  # Example of expected values
    mismatch_rate = ((df['amount'] != pd.Series(expected_amount)).sum()) / len(df) * 100

    # Log the calculated metrics
    logging.info(f"Missing Data Rate:\n{missing_data_rate}")
    logging.info(f"Mismatch Rate: {mismatch_rate}%")
    
    print(f"Missing Data Rate:\n{missing_data_rate}")
    print(f"Mismatch Rate: {mismatch_rate}%")

# Step 3: Function to run the scheduled task
def run_monitoring_task():
    print("Running data quality check...")
    calculate_quality_metrics(df)

# Step 4: Schedule the task to run every 5 minutes
schedule.every(5).minutes.do(run_monitoring_task)

# Step 5: Keep the script running to monitor data quality periodically
try:
    while True:
        schedule.run_pending()
        time.sleep(1)  # Wait for 1 second before checking again
except KeyboardInterrupt:
    print("Monitoring script has been stopped by the user.")


Running data quality check...
Missing Data Rate:
transaction_id    20.0
amount            20.0
date              20.0
dtype: float64
Mismatch Rate: 40.0%
Running data quality check...
Missing Data Rate:
transaction_id    20.0
amount            20.0
date              20.0
dtype: float64
Mismatch Rate: 40.0%


In [None]:
import pandas as pd

# Sample data to write to CSV
data = {
    'transaction_id': [1, 2, 3, 4, 5],
    'amount': [100.0, 200.5, None, 400.0, 500.5],
    'date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05']
}

# Create a DataFrame
df = pd.DataFrame(data)

# Save to CSV
df.to_csv('transaction_data.csv', index=False)
print("CSV file 'transaction_data.csv' created successfully.")


CSV file 'transaction_data.csv' created successfully.
