## Calculate Data Quality Score
**Introduction**: In this activity, you will calculate data quality scores for datasets using different metrics. You will explore examples where you assess completeness, accuracy, and consistency.

### Task 1: Completeness Score
1. Objective: Determine the percentage of non-missing values in a dataset.
2. Steps:
    - Load a sample dataset using Pandas.
    - Identify the columns with missing values.
    - Calculate the completeness score as the ratio of non-missing values to total values.
    - E.g., a dataset with customer information.

In [1]:
# Write your code from here
import pandas as pd
from io import StringIO

# Sample customer dataset with some missing values
csv_data = """CustomerID,Name,Email,Phone
1,Alice,alice@example.com,1234567890
2,Bob,,9876543210
3,Charlie,charlie@example.com,
4,David,david@example.com,4567891230
5,Eve,,5678912345
"""

# Load dataset into DataFrame
df = pd.read_csv(StringIO(csv_data))

# Display the dataset
print("Dataset:\n", df)

# Calculate completeness for each column
completeness_per_column = df.notnull().mean()

# Overall completeness score (mean of column completeness)
overall_completeness_score = completeness_per_column.mean()

print("\nCompleteness per column:")
print(completeness_per_column)

print(f"\nOverall Completeness Score: {overall_completeness_score:.2f}")

Dataset:
    CustomerID     Name                Email         Phone
0           1    Alice    alice@example.com  1.234568e+09
1           2      Bob                  NaN  9.876543e+09
2           3  Charlie  charlie@example.com           NaN
3           4    David    david@example.com  4.567891e+09
4           5      Eve                  NaN  5.678912e+09

Completeness per column:
CustomerID    1.0
Name          1.0
Email         0.6
Phone         0.8
dtype: float64

Overall Completeness Score: 0.85


### Task 2: Accuracy Score

1. Objective: Measure the accuracy of a dataset by comparing it against a reference dataset.
2. Steps:
    - Load the main dataset and a reference dataset.
    - Select key columns for accuracy check.
    - Match values from both datasets and calculate the accuracy percentage.
    - E.g., along existing dataset with sales information.

In [2]:
# Write your code from here
import pandas as pd
from io import StringIO

# Main dataset (possibly containing errors)
main_data = """OrderID,Customer,Amount
101,John,250
102,Alice,300
103,Bob,275
104,David,400
105,Eve,350
"""

# Reference dataset (considered correct)
reference_data = """OrderID,Customer,Amount
101,John,250
102,Alice,300
103,Bob,280
104,David,400
105,Eve,340
"""

# Load both datasets
main_df = pd.read_csv(StringIO(main_data))
ref_df = pd.read_csv(StringIO(reference_data))

# Merge on OrderID to compare records
merged_df = pd.merge(main_df, ref_df, on="OrderID", suffixes=("_main", "_ref"))

# Accuracy check for 'Amount' column
merged_df["Amount_match"] = merged_df["Amount_main"] == merged_df["Amount_ref"]

# Calculate accuracy score as percentage of matches
accuracy_score = merged_df["Amount_match"].mean()

# Display comparison and result
print("Comparison Data:\n", merged_df[["OrderID", "Amount_main", "Amount_ref", "Amount_match"]])
print(f"\nAccuracy Score for 'Amount': {accuracy_score:.2f}")

Comparison Data:
    OrderID  Amount_main  Amount_ref  Amount_match
0      101          250         250          True
1      102          300         300          True
2      103          275         280         False
3      104          400         400          True
4      105          350         340         False

Accuracy Score for 'Amount': 0.60


### Task 3: Consistency Score

1. Objective: Evaluate the consistency within a dataset for specific columns.
2. Steps:
    - Choose a column expected to have consistent values.
    - Use statistical or rule-based checks to identify inconsistencies.
    - Calculate the consistency score by the ratio of consistent to total entries.
    - E.g., validating phone number formats in a contact list.

In [3]:
# Write your code from here
import pandas as pd
import re

# Sample dataset with phone numbers (some inconsistent)
data = {
    "Name": ["Alice", "Bob", "Charlie", "David", "Eve"],
    "Phone": ["123-456-7890", "1234567890", "123-456-7890", "123.456.7890", "123-45-6789"]
}

df = pd.DataFrame(data)

# Define a regex pattern for consistent format: XXX-XXX-XXXX
pattern = re.compile(r"^\d{3}-\d{3}-\d{4}$")

# Check consistency
df["Consistent"] = df["Phone"].apply(lambda x: bool(pattern.match(x)))

# Calculate consistency score
consistency_score = df["Consistent"].mean()

# Output
print("Phone Number Consistency Check:\n", df[["Name", "Phone", "Consistent"]])
print(f"\nConsistency Score: {consistency_score:.2f}")

Phone Number Consistency Check:
       Name         Phone  Consistent
0    Alice  123-456-7890        True
1      Bob    1234567890       False
2  Charlie  123-456-7890        True
3    David  123.456.7890       False
4      Eve   123-45-6789       False

Consistency Score: 0.40
