## Defining Data Quality SLAs
### Data Completeness
**Description**: Set an SLA that ensures that 95% of data fields in your dataset are filled (non-null values). Practice by checking a dataset of your choice and calculate its completeness.

In [1]:
# write your code from here
import pandas as pd
import numpy as np

# Sample dataset defined in code
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', None],
    'Age': [25, None, 30, 22, 28],
    'Email': ['alice@example.com', 'bob@example.com', None, 'david@example.com', 'eve@example.com'],
    'City': ['New York', 'Los Angeles', 'Chicago', None, 'Houston']
}

# Create DataFrame
df = pd.DataFrame(data)

# Print dataset
print("Sample Dataset:")
print(df)

# Calculate total cells and non-null cells
total_cells = df.size
non_null_cells = df.count().sum()

# Calculate completeness
completeness = (non_null_cells / total_cells) * 100

# Print completeness
print(f"\nData Completeness: {completeness:.2f}%")

# SLA check
if completeness >= 95:
    print("✅ SLA met: Data completeness is acceptable.")
else:
    print("❌ SLA not met: Investigate missing data.")

# Optional: See missing values per column
print("\nMissing Values per Column:")
print(df.isnull().sum())


Sample Dataset:
      Name   Age              Email         City
0    Alice  25.0  alice@example.com     New York
1      Bob   NaN    bob@example.com  Los Angeles
2  Charlie  30.0               None      Chicago
3    David  22.0  david@example.com         None
4     None  28.0    eve@example.com      Houston

Data Completeness: 80.00%
❌ SLA not met: Investigate missing data.

Missing Values per Column:
Name     1
Age      1
Email    1
City     1
dtype: int64


### Data Timeliness:
**Description**: Establish an SLA that specifies that data should be integrated and processed within 24 hours of acquisition. Monitor the data pipeline for timeliness.

In [2]:
# write your code from here
import pandas as pd
from datetime import datetime, timedelta

# Step 1: Sample dataset
data = {
    'record_id': [1, 2, 3, 4, 5],
    'acquisition_time': [
        datetime(2025, 5, 27, 10, 0),
        datetime(2025, 5, 27, 12, 0),
        datetime(2025, 5, 27, 9, 30),
        datetime(2025, 5, 26, 8, 0),
        datetime(2025, 5, 27, 14, 0)
    ],
    'processing_time': [
        datetime(2025, 5, 27, 18, 0),   # 8 hrs
        datetime(2025, 5, 28, 11, 0),   # 23 hrs
        datetime(2025, 5, 27, 20, 0),   # 10.5 hrs
        datetime(2025, 5, 27, 12, 0),   # 28 hrs ❌
        datetime(2025, 5, 28, 12, 0)    # 22 hrs
    ]
}

# Step 2: Create DataFrame
df = pd.DataFrame(data)

# Step 3: Calculate time delta
df['time_to_process'] = df['processing_time'] - df['acquisition_time']

# Step 4: Evaluate SLA (24 hours)
sla_limit = pd.Timedelta(hours=24)
df['sla_met'] = df['time_to_process'] <= sla_limit

# Step 5: Calculate SLA compliance rate
sla_compliance_rate = df['sla_met'].mean() * 100

# Step 6: Display results
print("📊 Timeliness SLA Check:")
print(df[['record_id', 'acquisition_time', 'processing_time', 'time_to_process', 'sla_met']])
print(f"\n📈 Data Timeliness Compliance: {sla_compliance_rate:.2f}%")

if sla_compliance_rate >= 95:
    print("✅ SLA met: Data is timely.")
else:
    print("❌ SLA not met: Investigate processing delays.")


📊 Timeliness SLA Check:
   record_id    acquisition_time     processing_time time_to_process  sla_met
0          1 2025-05-27 10:00:00 2025-05-27 18:00:00 0 days 08:00:00     True
1          2 2025-05-27 12:00:00 2025-05-28 11:00:00 0 days 23:00:00     True
2          3 2025-05-27 09:30:00 2025-05-27 20:00:00 0 days 10:30:00     True
3          4 2025-05-26 08:00:00 2025-05-27 12:00:00 1 days 04:00:00    False
4          5 2025-05-27 14:00:00 2025-05-28 12:00:00 0 days 22:00:00     True

📈 Data Timeliness Compliance: 80.00%
❌ SLA not met: Investigate processing delays.


### Data Consistency:
**Description**: Define an SLA for maintaining consistency across various related datasets. Implement a check to ensure that 99% of data entries are consistent.

In [1]:
# write your code from here
import pandas as pd

# Sample Dataset 1: Customers
customers = pd.DataFrame({
    'customer_id': [101, 102, 103, 104],
    'name': ['Alice', 'Bob', 'Charlie', 'David']
})

# Sample Dataset 2: Orders
orders = pd.DataFrame({
    'order_id': [1001, 1002, 1003, 1004, 1005, 1006],
    'customer_id': [101, 102, 105, 103, 104, 106],  # Note: 105 and 106 don't exist in customers
    'amount': [250, 300, 150, 400, 500, 350]
})

# Step 1: Check consistency (join condition: customer_id exists)
orders['customer_exists'] = orders['customer_id'].isin(customers['customer_id'])

# Step 2: Calculate consistency rate
consistency_rate = orders['customer_exists'].mean() * 100

# Step 3: Display results
print("📊 Orders Dataset with Consistency Check:")
print(orders[['order_id', 'customer_id', 'customer_exists']])

print(f"\n📈 Data Consistency Rate: {consistency_rate:.2f}%")

# Step 4: SLA Evaluation
if consistency_rate >= 99:
    print("✅ SLA met: Data is consistent across datasets.")
else:
    print("❌ SLA not met: Investigate data mismatches.")


📊 Orders Dataset with Consistency Check:
   order_id  customer_id  customer_exists
0      1001          101             True
1      1002          102             True
2      1003          105            False
3      1004          103             True
4      1005          104             True
5      1006          106            False

📈 Data Consistency Rate: 66.67%
❌ SLA not met: Investigate data mismatches.
