### Define Data Quality KPIs

**Task 1**: Identify Relevant KPIs

**Objective**: Develop KPIs that align with organizational goals.

**Steps**:
1. Choose a dataset from a domain of your interest (e.g., sales data, healthcare records, or transaction logs).
2. Identify three KPIs that would be crucial for assessing the data quality in your chosen dataset. Consider accuracy, completeness, and timeliness.
3. Document why each KPI is important for maintaining high-quality data in your given context.

In [1]:
# Task 1: Identify Relevant KPIs for Sales Data

import pandas as pd
import numpy as np
import datetime as dt

# Sample Sales Dataset
data = {
    'Order_ID': [101, 102, 103, 104, 104],
    'Customer_Name': ['Alice', 'Bob', 'Charlie', None, 'Eve'],
    'Order_Date': ['2023-01-10', '2023-01-15', '2023-01-20', '2023-01-25', None],
    'Amount': [250, 300, 400, None, 200]
}

df = pd.DataFrame(data)
df['Order_Date'] = pd.to_datetime(df['Order_Date'], errors='coerce')

# KPI 1: Completeness (Percentage of non-null values)
def kpi_completeness(df):
    return (df.notnull().sum() / len(df) * 100).round(2)

# KPI 2: Accuracy (simulate with known valid order IDs)
def kpi_accuracy(df, column, reference_list):
    valid = df[column].isin(reference_list).sum()
    total = df[column].notnull().sum()
    return round(100 * valid / total, 2) if total > 0 else 0

# KPI 3: Timeliness (e.g., how recent the last order is)
def kpi_timeliness(df, date_column):
    today = pd.Timestamp.today()
    most_recent = df[date_column].max()
    days_diff = (today - most_recent).days if pd.notnull(most_recent) else np.nan
    return f"{days_diff} days since last entry"

# Display KPI Results
print("=== Data Quality KPIs for Sales Dataset ===")
print("\nKPI 1 - Completeness (%):")
print(kpi_completeness(df))

print(f"\nKPI 2 - Accuracy of 'Order_ID': {kpi_accuracy(df, 'Order_ID', [101, 102, 103, 104])}%")

print(f"\nKPI 3 - Timeliness of 'Order_Date': {kpi_timeliness(df, 'Order_Date')}")


=== Data Quality KPIs for Sales Dataset ===

KPI 1 - Completeness (%):
Order_ID         100.0
Customer_Name     80.0
Order_Date        80.0
Amount            80.0
dtype: float64

KPI 2 - Accuracy of 'Order_ID': 100.0%

KPI 3 - Timeliness of 'Order_Date': 846 days since last entry


**Task 2**: Develop a KPI Dashboard

**Objective**: Visualize your KPIs for better monitoring.

**Steps**:
1. Use a tool like Excel or a BI tool (e.g., Tableau, Power BI) to create a simple dashboard.
2. Input sample data and visualize your chosen KPIs, showing how they would be monitored.
3. Share your dashboard with peers and gather feedback on KPI relevance and clarity.

In [2]:
import pandas as pd
import numpy as np
from datetime import datetime
from dash import Dash, html, dcc
import plotly.graph_objs as go

# Sample Data
data = {
    'Order_ID': [101, 102, 103, 104, 104],
    'Customer_Name': ['Alice', 'Bob', 'Charlie', None, 'Eve'],
    'Order_Date': ['2023-01-10', '2023-01-15', '2023-01-20', '2023-01-25', None],
    'Amount': [250, 300, 400, None, 200]
}
df = pd.DataFrame(data)
df['Order_Date'] = pd.to_datetime(df['Order_Date'], errors='coerce')

# KPI Calculations
def kpi_completeness(df):
    return (df.notnull().sum() / len(df) * 100).round(2)

def kpi_accuracy(df, column, reference_list):
    valid = df[column].isin(reference_list).sum()
    total = df[column].notnull().sum()
    return round(100 * valid / total, 2) if total > 0 else 0

def kpi_timeliness(df, date_column):
    today = pd.Timestamp.today()
    most_recent = df[date_column].max()
    days_diff = (today - most_recent).days if pd.notnull(most_recent) else np.nan
    return days_diff

# Calculated KPI Values
completeness = kpi_completeness(df)
accuracy = kpi_accuracy(df, 'Order_ID', [101, 102, 103, 104])
timeliness = kpi_timeliness(df, 'Order_Date')

# Dash App
app = Dash(__name__)

app.layout = html.Div([
    html.H1("Sales Data Quality KPI Dashboard"),
    
    html.Div([
        html.Div([
            html.H3("Completeness (%)"),
            dcc.Graph(
                figure=go.Bar(
                    x=completeness.index,
                    y=completeness.values,
                    marker_color='lightblue'
                )
            )
        ], style={'width': '32%', 'display': 'inline-block', 'padding': '10px'}),

        html.Div([
            html.H3("Accuracy of Order_ID (%)"),
            html.H4(f"{accuracy}%")
        ], style={'width': '32%', 'display': 'inline-block', 'padding': '10px'}),

        html.Div([
            html.H3("Timeliness (Days since last order)"),
            html.H4(f"{timeliness} days")
        ], style={'width': '32%', 'display': 'inline-block', 'padding': '10px'})
    ])
])

if __name__ == '__main__':
    app.run_server(debug=True)


ObsoleteAttributeException: app.run_server has been replaced by app.run