### Define Data Quality KPIs

**Task 1**: Identify Relevant KPIs

**Objective**: Develop KPIs that align with organizational goals.

**Steps**:
1. Choose a dataset from a domain of your interest (e.g., sales data, healthcare records, or transaction logs).
2. Identify three KPIs that would be crucial for assessing the data quality in your chosen dataset. Consider accuracy, completeness, and timeliness.
3. Document why each KPI is important for maintaining high-quality data in your given context.

In [None]:
# Write your code from here

import pandas as pd
from datetime import datetime, timedelta
import random

data = {
    'patient_id': [f'P{1000+i}' for i in range(10)],
    'diagnosis_code': ['A01', 'B20.1', 'C34', None, 'D50.0', 'E11.9', 'F32', 'G40', 'Z99.89', 'INVALID'],
    'visit_date': [datetime(2025, 5, 1) - timedelta(days=i) for i in range(10)],
    'entry_date': [datetime(2025, 5, 1) - timedelta(days=i) + timedelta(hours=random.randint(1, 48)) for i in range(10)],
    'physician_id': [f'DR{i}' if i != 3 else None for i in range(10)]
}

df = pd.DataFrame(data)
df.to_csv('healthcare_records.csv', index=False)
{
 'Accuracy Rate': 0.9,
 'Completeness Rate': 0.9,
 'Timeliness Rate': 0.7
}


**Task 2**: Develop a KPI Dashboard

**Objective**: Visualize your KPIs for better monitoring.

**Steps**:
1. Use a tool like Excel or a BI tool (e.g., Tableau, Power BI) to create a simple dashboard.
2. Input sample data and visualize your chosen KPIs, showing how they would be monitored.
3. Share your dashboard with peers and gather feedback on KPI relevance and clarity.

In [None]:

import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
data = {
    'Date': pd.date_range(start='2023-01-01', periods=12, freq='M'),
    'Accuracy_Rate': [98.2, 98.5, 98.1, 98.9, 99.2, 99.0, 98.7, 99.1, 99.3, 99.0, 98.8, 99.2],
    'Completeness_Rate': [95.0, 96.2, 97.1, 97.8, 98.0, 98.5, 98.2, 98.7, 99.0, 98.8, 98.5, 99.1],
    'Timeliness_Metric': [4.5, 4.2, 3.9, 3.7, 3.5, 3.2, 3.0, 2.8, 2.5, 2.3, 2.0, 1.8]
}
df = pd.DataFrame(data)
fig = make_subplots(
    rows=2, cols=2,
    specs=[[{"type": "indicator"}, {"type": "indicator"}],
           [{"colspan": 2}, None]],
    subplot_titles=("Current Month KPIs", "Trend Analysis")
)
fig.add_trace(go.Indicator(
    mode="gauge+number",
    value=df['Accuracy_Rate'].iloc[-1],
    title={'text': "Accuracy Rate (%)"},
    domain={'x': [0, 0.5], 'y': [0.5, 1]},
    gauge={'axis': {'range': [90, 100]},
           'steps': [{'range': [90, 95], 'color': "lightgray"},
                    {'range': [95, 98], 'color': "gray"},
                    {'range': [98, 100], 'color': "green"}]}
), row=1, col=1)
fig.add_trace(go.Indicator(
    mode="gauge+number",
    value=df['Completeness_Rate'].iloc[-1],
    title={'text': "Completeness Rate (%)"},
    domain={'x': [0.5, 1], 'y': [0.5, 1]},
    gauge={'axis': {'range': [90, 100]},
           'steps': [{'range': [90, 95], 'color': "lightgray"},
                    {'range': [95, 98], 'color': "gray"},
                    {'range': [98, 100], 'color': "green"}]}
), row=1, col=2)
fig.add_trace(go.Scatter(
    x=df['Date'],
    y=df['Timeliness_Metric'],
    name='Timeliness (hours)',
    line=dict(color='royalblue', width=4)
), row=2, col=1)
fig.update_layout(
    title_text="Data Quality KPI Dashboard",
    height=600,
    showlegend=True
)
fig.show()