## Calculate Data Quality Score
**Introduction**: In this activity, you will calculate data quality scores for datasets using different metrics. You will explore examples where you assess completeness, accuracy, and consistency.

### Task 1: Completeness Score
1. Objective: Determine the percentage of non-missing values in a dataset.
2. Steps:
    - Load a sample dataset using Pandas.
    - Identify the columns with missing values.
    - Calculate the completeness score as the ratio of non-missing values to total values.
    - E.g., a dataset with customer information.

In [1]:
import pandas as pd

# Sample customer dataset
data = {
    'CustomerID': [1, 2, 3, 4, 5],
    'Name': ['Alice', 'Bob', 'Charlie', None, 'Eve'],
    'Email': ['alice@example.com', None, 'charlie@example.com', 'dave@example.com', None],
    'Phone': ['123-456-7890', '987-654-3210', None, None, '456-789-0123']
}

# Load data into a DataFrame
df = pd.DataFrame(data)

# Calculate total and non-missing values
total_values = df.size
non_missing_values = df.count().sum()

# Calculate overall completeness score
completeness_score = (non_missing_values / total_values) * 100

# Calculate per-column completeness (optional)
column_completeness = df.notnull().mean() * 100

# Display results
print("📄 Sample Data:\n", df)
print("\n🔍 Missing Values Per Column:\n", df.isnull().sum())
print(f"\n✅ Overall Completeness Score: {completeness_score:.2f}%")
print("\n📊 Completeness Per Column (%):\n", column_completeness)


📄 Sample Data:
    CustomerID     Name                Email         Phone
0           1    Alice    alice@example.com  123-456-7890
1           2      Bob                 None  987-654-3210
2           3  Charlie  charlie@example.com          None
3           4     None     dave@example.com          None
4           5      Eve                 None  456-789-0123

🔍 Missing Values Per Column:
 CustomerID    0
Name          1
Email         2
Phone         2
dtype: int64

✅ Overall Completeness Score: 75.00%

📊 Completeness Per Column (%):
 CustomerID    100.0
Name           80.0
Email          60.0
Phone          60.0
dtype: float64


### Task 2: Accuracy Score

1. Objective: Measure the accuracy of a dataset by comparing it against a reference dataset.
2. Steps:
    - Load the main dataset and a reference dataset.
    - Select key columns for accuracy check.
    - Match values from both datasets and calculate the accuracy percentage.
    - E.g., along existing dataset with sales information.

In [1]:
# Write your code from here
import pandas as pd

# Main dataset (e.g., collected sales records)
main_data = {
    'TransactionID': [101, 102, 103, 104, 105],
    'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headset'],
    'Amount': [1200, 25, 45, 300, 80]
}

# Reference dataset (e.g., trusted system of record)
reference_data = {
    'TransactionID': [101, 102, 103, 104, 105],
    'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Speaker'],  # Note: 'Headset' vs 'Speaker'
    'Amount': [1200, 25, 45, 310, 80]  # Note: '300' vs '310'
}

# Load both into DataFrames
df_main = pd.DataFrame(main_data)
df_ref = pd.DataFrame(reference_data)

# Merge datasets on TransactionID
df_merged = pd.merge(df_main, df_ref, on='TransactionID', suffixes=('_main', '_ref'))

# Columns to check for accuracy
columns_to_check = ['Product', 'Amount']

# Count matches
total_checks = len(df_merged) * len(columns_to_check)
match_count = 0

for col in columns_to_check:
    match_count += (df_merged[f"{col}_main"] == df_merged[f"{col}_ref"]).sum()

# Calculate accuracy score
accuracy_score = (match_count / total_checks) * 100

# Show results
print("📄 Merged Dataset for Accuracy Check:\n", df_merged)
print(f"\n✅ Accuracy Score: {accuracy_score:.2f}%")


📄 Merged Dataset for Accuracy Check:
    TransactionID Product_main  Amount_main Product_ref  Amount_ref
0            101       Laptop         1200      Laptop        1200
1            102        Mouse           25       Mouse          25
2            103     Keyboard           45    Keyboard          45
3            104      Monitor          300     Monitor         310
4            105      Headset           80     Speaker          80

✅ Accuracy Score: 80.00%


### Task 3: Consistency Score

1. Objective: Evaluate the consistency within a dataset for specific columns.
2. Steps:
    - Choose a column expected to have consistent values.
    - Use statistical or rule-based checks to identify inconsistencies.
    - Calculate the consistency score by the ratio of consistent to total entries.
    - E.g., validating phone number formats in a contact list.

In [3]:
# Write your code from here
import pandas as pd
import re

# Sample dataset with phone numbers
data = {
    'CustomerID': [1, 2, 3, 4, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Phone': ['123-456-7890', '(987) 654-3210', '4567890123', '123-4567', '987-654-3210']
}

# Load into DataFrame
df = pd.DataFrame(data)

# Define regex pattern for valid phone format: e.g., XXX-XXX-XXXX or (XXX) XXX-XXXX
pattern = re.compile(r'^(\(\d{3}\)\s?\d{3}-\d{4}|\d{3}-\d{3}-\d{4})$')

# Check consistency using the pattern
df['Phone_Consistent'] = df['Phone'].apply(lambda x: bool(pattern.match(str(x))))

# Calculate consistency score
total_entries = df.shape[0]
consistent_entries = df['Phone_Consistent'].sum()
consistency_score = (consistent_entries / total_entries) * 100

# Show results
print("📄 Dataset with Consistency Check:\n", df)
print(f"\n✅ Consistency Score: {consistency_score:.2f}%")



📄 Dataset with Consistency Check:
    CustomerID     Name           Phone  Phone_Consistent
0           1    Alice    123-456-7890              True
1           2      Bob  (987) 654-3210              True
2           3  Charlie      4567890123             False
3           4    David        123-4567             False
4           5      Eve    987-654-3210              True

✅ Consistency Score: 60.00%
