## Calculate Data Quality Score
**Introduction**: In this activity, you will calculate data quality scores for datasets using different metrics. You will explore examples where you assess completeness, accuracy, and consistency.

### Task 1: Completeness Score
1. Objective: Determine the percentage of non-missing values in a dataset.
2. Steps:
    - Load a sample dataset using Pandas.
    - Identify the columns with missing values.
    - Calculate the completeness score as the ratio of non-missing values to total values.
    - E.g., a dataset with customer information.

In [1]:
# Write your code from here
import pandas as pd

def load_data(csv_path):
    try:
        df = pd.read_csv(csv_path)
        print(f"Data loaded successfully with {len(df)} rows.")
        return df
    except Exception as e:
        print(f"Error loading CSV: {e}")
        return None

def completeness(df):
    """Percentage of non-null values per column"""
    comp = 100 * df.notnull().mean()
    return comp.to_dict()

def validity_email(df):
    """Percentage of Email entries containing '@'"""
    if 'Email' not in df.columns:
        print("No 'Email' column found.")
        return None
    valid_emails = df['Email'].dropna().str.contains('@')
    validity = 100 * valid_emails.mean() if len(valid_emails) > 0 else 0
    return validity

def uniqueness_email(df):
    """Count of distinct Email entries"""
    if 'Email' not in df.columns:
        print("No 'Email' column found.")
        return None
    unique_count = df['Email'].nunique(dropna=True)
    return unique_count

# === Example usage ===

csv_path = 'your_dataset.csv'  # Replace with your CSV path
df = load_data(csv_path)

if df is not None:
    comp = completeness(df)
    val = validity_email(df)
    uniq = uniqueness_email(df)

    print("\nCompleteness (%) per column:")
    for col, pct in comp.items():
        print(f"{col}: {pct:.2f}%")

    print(f"\nValidity (%) of Email column containing '@': {val:.2f}%")
    print(f"Uniqueness (distinct count) in Email column: {uniq}")


Data loaded successfully with 10 rows.

Completeness (%) per column:
Name: 100.00%
Email: 80.00%
Age: 90.00%

Validity (%) of Email column containing '@': 75.00%
Uniqueness (distinct count) in Email column: 8


### Task 2: Accuracy Score

1. Objective: Measure the accuracy of a dataset by comparing it against a reference dataset.
2. Steps:
    - Load the main dataset and a reference dataset.
    - Select key columns for accuracy check.
    - Match values from both datasets and calculate the accuracy percentage.
    - E.g., along existing dataset with sales information.

In [2]:
# Write your code from here
import pandas as pd

def load_data(path):
    try:
        df = pd.read_csv(path)
        print(f"Loaded dataset '{path}' with {len(df)} rows.")
        return df
    except Exception as e:
        print(f"Error loading {path}: {e}")
        return None

def calculate_accuracy(main_df, ref_df, key_columns):
    """
    Compares main_df and ref_df on key_columns and calculates
    the percentage of matching rows.
    """

    # Merge on key columns with indicator to find matches
    merged = main_df.merge(ref_df[key_columns], on=key_columns, how='left', indicator=True)

    # Rows with _merge == 'both' exist in both datasets (matched)
    matched_rows = merged[merged['_merge'] == 'both'].shape[0]
    total_rows = main_df.shape[0]

    accuracy = 100 * matched_rows / total_rows if total_rows > 0 else 0
    return accuracy

# === Usage example ===

# Load datasets (replace with your file paths)
main_data = load_data('main_dataset.csv')
reference_data = load_data('reference_dataset.csv')

if main_data is not None and reference_data is not None:
    # Specify columns to check for accuracy
    keys = ['SaleID', 'ProductID', 'SaleAmount']  # example columns
    
    # Calculate accuracy
    acc = calculate_accuracy(main_data, reference_data, keys)
    print(f"Accuracy based on matching key columns: {acc:.2f}%")
else:
    print("Failed to load datasets for accuracy calculation.")



Loaded dataset 'main_dataset.csv' with 5 rows.
Loaded dataset 'reference_dataset.csv' with 5 rows.
Accuracy based on matching key columns: 60.00%


### Task 3: Consistency Score

1. Objective: Evaluate the consistency within a dataset for specific columns.
2. Steps:
    - Choose a column expected to have consistent values.
    - Use statistical or rule-based checks to identify inconsistencies.
    - Calculate the consistency score by the ratio of consistent to total entries.
    - E.g., validating phone number formats in a contact list.

In [3]:
# Write your code from here
import pandas as pd
import re

def load_data(csv_path):
    try:
        df = pd.read_csv(csv_path)
        print(f"Loaded {len(df)} rows from {csv_path}")
        return df
    except Exception as e:
        print(f"Error loading CSV: {e}")
        return None

def is_valid_phone(phone):
    """
    Validates phone numbers with a simple regex pattern:
    Format example: +CountryCode followed by 10 digits, or just 10 digits.
    Modify the regex according to your consistency rules.
    """
    if pd.isna(phone):
        return False
    pattern = re.compile(r'^(\+\d{1,3})?[\s\-]?\d{10}$')  
    return bool(pattern.match(str(phone).strip()))

def calculate_consistency(df, column):
    if column not in df.columns:
        print(f"Column '{column}' not found in dataset.")
        return None

    total = len(df)
    consistent = df[column].apply(is_valid_phone).sum()
    consistency_score = 100 * consistent / total if total > 0 else 0
    return consistency_score

# === Example usage ===
csv_file = 'contacts.csv'  # Replace with your file path
df = load_data(csv_file)

if df is not None:
    col = 'PhoneNumber'  # Replace with your column name
    score = calculate_consistency(df, col)
    if score is not None:
        print(f"Consistency Score for '{col}': {score:.2f}%")
else:
    print("Failed to load data.")



Loaded 10 rows from contacts.csv
Consistency Score for 'PhoneNumber': 50.00%
