## Calculate Data Quality Score
**Introduction**: In this activity, you will calculate data quality scores for datasets using different metrics. You will explore examples where you assess completeness, accuracy, and consistency.

### Task 1: Completeness Score
1. Objective: Determine the percentage of non-missing values in a dataset.
2. Steps:
    - Load a sample dataset using Pandas.
    - Identify the columns with missing values.
    - Calculate the completeness score as the ratio of non-missing values to total values.
    - E.g., a dataset with customer information.

In [1]:
# Write your code from here
import pandas as pd

# Sample data creation (you can replace this by loading your own CSV)
data = {
    'CustomerID': [101, 102, 103, 104, 105],
    'Name': ['Alice', 'Bob', None, 'David', 'Eva'],
    'Email': ['alice@example.com', None, 'charlie@example.com', 'david@example.com', 'eva@example.com'],
    'Age': [25, 30, 22, None, 29]
}

df = pd.DataFrame(data)

print("Dataset:")
print(df)

# 1. Identify columns with missing values
missing_counts = df.isnull().sum()
print("\nMissing values per column:")
print(missing_counts)

# 2. Calculate completeness score
total_values = df.size  # total number of cells in dataframe
non_missing_values = df.count().sum()  # total non-null values across all columns

completeness_score = (non_missing_values / total_values) * 100  # percentage

print(f"\nCompleteness Score: {completeness_score:.2f}%")


Dataset:
   CustomerID   Name                Email   Age
0         101  Alice    alice@example.com  25.0
1         102    Bob                 None  30.0
2         103   None  charlie@example.com  22.0
3         104  David    david@example.com   NaN
4         105    Eva      eva@example.com  29.0

Missing values per column:
CustomerID    0
Name          1
Email         1
Age           1
dtype: int64

Completeness Score: 85.00%


### Task 2: Accuracy Score

1. Objective: Measure the accuracy of a dataset by comparing it against a reference dataset.
2. Steps:
    - Load the main dataset and a reference dataset.
    - Select key columns for accuracy check.
    - Match values from both datasets and calculate the accuracy percentage.
    - E.g., along existing dataset with sales information.

In [2]:
# Write your code from here

import pandas as pd

# Sample main dataset (e.g., sales data)
main_data = {
    'SaleID': [1, 2, 3, 4, 5],
    'Product': ['A', 'B', 'C', 'D', 'E'],
    'Quantity': [10, 20, 15, 5, 7],
    'Price': [100, 200, 150, 50, 70]
}
df_main = pd.DataFrame(main_data)

# Sample reference dataset (the "correct" data)
reference_data = {
    'SaleID': [1, 2, 3, 4, 5],
    'Product': ['A', 'B', 'C', 'D', 'E'],
    'Quantity': [10, 20, 10, 5, 7],  # Note: Quantity for SaleID 3 is different (15 vs 10)
    'Price': [100, 200, 150, 50, 70]
}
df_ref = pd.DataFrame(reference_data)

print("Main Dataset:")
print(df_main)

print("\nReference Dataset:")
print(df_ref)

# Columns to check for accuracy
key_columns = ['Quantity', 'Price']

# Merge datasets on a unique identifier (SaleID)
df_compare = pd.merge(df_main, df_ref, on='SaleID', suffixes=('_main', '_ref'))

# Check accuracy per column: compare values from main vs reference dataset
accuracy_results = {}
for col in key_columns:
    correct_matches = (df_compare[f"{col}_main"] == df_compare[f"{col}_ref"]).sum()
    total = len(df_compare)
    accuracy = (correct_matches / total) * 100
    accuracy_results[col] = accuracy

# Calculate overall accuracy as average across key columns
overall_accuracy = sum(accuracy_results.values()) / len(accuracy_results)

# Output
print("\nAccuracy Results per column:")
for col, acc in accuracy_results.items():
    print(f"{col}: {acc:.2f}%")

print(f"\nOverall Accuracy Score: {overall_accuracy:.2f}%")


Main Dataset:
   SaleID Product  Quantity  Price
0       1       A        10    100
1       2       B        20    200
2       3       C        15    150
3       4       D         5     50
4       5       E         7     70

Reference Dataset:
   SaleID Product  Quantity  Price
0       1       A        10    100
1       2       B        20    200
2       3       C        10    150
3       4       D         5     50
4       5       E         7     70

Accuracy Results per column:
Quantity: 80.00%
Price: 100.00%

Overall Accuracy Score: 90.00%


### Task 3: Consistency Score

1. Objective: Evaluate the consistency within a dataset for specific columns.
2. Steps:
    - Choose a column expected to have consistent values.
    - Use statistical or rule-based checks to identify inconsistencies.
    - Calculate the consistency score by the ratio of consistent to total entries.
    - E.g., validating phone number formats in a contact list.

In [3]:
# Write your code from here
import pandas as pd
import re

# Sample dataset with phone numbers
data = {
    'ContactID': [1, 2, 3, 4, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Phone': ['+1-202-555-0156', '2025550157', '+1-202-555-0158', '12345', '+1-202-555-0159']
}

df = pd.DataFrame(data)

print("Dataset:")
print(df)

# Define a phone number consistency check function (e.g., US number format with country code)
def is_phone_consistent(phone):
    pattern = r'^\+1-\d{3}-\d{3}-\d{4}$'  # Example: +1-202-555-0156
    return bool(re.match(pattern, phone))

# Apply consistency check
df['is_consistent'] = df['Phone'].apply(is_phone_consistent)

# Calculate consistency score
consistent_count = df['is_consistent'].sum()
total_count = len(df)
consistency_score = (consistent_count / total_count) * 100

print("\nPhone Number Consistency Check:")
print(df[['Phone', 'is_consistent']])

print(f"\nConsistency Score: {consistency_score:.2f}%")



Dataset:
   ContactID     Name            Phone
0          1    Alice  +1-202-555-0156
1          2      Bob       2025550157
2          3  Charlie  +1-202-555-0158
3          4    David            12345
4          5      Eva  +1-202-555-0159

Phone Number Consistency Check:
             Phone  is_consistent
0  +1-202-555-0156           True
1       2025550157          False
2  +1-202-555-0158           True
3            12345          False
4  +1-202-555-0159           True

Consistency Score: 60.00%
