### Task 1: Understanding and Defining Data Quality Metrics
**Description**: Learn how to define basic data quality metrics such as completeness, validity, and uniqueness for a simple dataset.

**Steps**:
1. Dataset: Use a CSV with columns like Name , Email , Age .
2. Metric Definitions:
    - Completeness: Percentage of non-null values.
    - Validity: % of email fields containing @ .
    - Uniqueness: Count distinct entries in the Email column.

In [None]:
# Write your code from here

### Task 2: Calculating Data Quality Score
**Description**: Aggregate multiple metrics to calculate an overall data quality score.

**Steps**:
1. Formula: Simple average of all metrics defined in Task 1.

In [None]:
# Write your code from here

### Task 3: Creating Expectations for a CSV
**Description**: Develop basic data quality expectations using Great Expectations.

**Steps**:
1. Expectation Suite
2. Define Expectations for Completeness

In [None]:
# Write your code from here

### Task 4: Running and Validating Expectations
**Description**: Run the created expectations and generate an output report.

**Steps**:
1. Validate
2. Generate HTML Report

In [None]:
# Write your code from here


### Task 5: Automating Data Quality Score Calculation
**Description**: Automate the data quality score via a script that integrates with Great
Expectations.

In [2]:
# Write your code from here
! pip uninstall great_expectations
! pip install "great_expectations<3.0"


Found existing installation: great-expectations 1.4.4
Uninstalling great-expectations-1.4.4:
  Would remove:
    /home/vscode/.local/lib/python3.10/site-packages/great_expectations-1.4.4.dist-info/*
    /home/vscode/.local/lib/python3.10/site-packages/great_expectations/*
Proceed (Y/n)? [31mERROR: Operation cancelled by user[0m[31m
[0m^C
Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### Task 6: Leveraging Data Quality Metrics for Automated Data Cleaning
**Description**: Implement a system where if data quality metrics fall below a threshold,
automated data cleaning scripts are triggered.

**Steps**:
1. Define Cleaning Logic
2. Integrate with Great Expectations:
    - Use an action within the Great Expectations action list that only triggers if quality score is below a threshold, automating the cleaning.

In [5]:
# Write your code from here
import pandas as pd
import numpy as np
import re

# Step 1: Generate sample dataset
def generate_sample_data():
    data = {
        "Name": ["Alice", "Bob", "Charlie", None, "Eve", "Frank", "Grace", "Heidi", "Ivan", "Judy"],
        "Email": [
            "alice@example.com", "bob@example", "charlie@example.com", "dave@example.com",
            None, "frank@example.com", "grace@example.com", "heidi@example", "ivan@example.com", "judy@example.com"
        ],
        "Age": [25, 30, None, 22, 28, 35, None, 40, 29, 31]
    }
    df = pd.DataFrame(data)
    return df

# Step 2: Calculate completeness (% non-null per column)
def completeness(df):
    return df.notnull().mean() * 100

# Step 3: Calculate validity (% of emails containing '@')
def validity_email(df):
    email_col = df['Email'].dropna()
    valid_emails = email_col.apply(lambda x: '@' in x)
    return valid_emails.mean() * 100

# Step 4: Calculate uniqueness (% unique emails)
def uniqueness_email(df):
    email_col = df['Email'].dropna()
    unique_count = email_col.nunique()
    total_count = len(email_col)
    return (unique_count / total_count) * 100 if total_count > 0 else 0

# Step 5: Calculate overall data quality score (average of metrics)
def calculate_quality_score(completeness_scores, validity_score, uniqueness_score):
    avg_completeness = np.mean(completeness_scores)
    overall = np.mean([avg_completeness, validity_score, uniqueness_score])
    return overall

# Step 6: Automated cleaning if quality is below threshold
def clean_data(df):
    # Fill missing names with 'Unknown'
    df['Name'] = df['Name'].fillna('Unknown')
    # Remove invalid emails (no '@')
    df = df[df['Email'].apply(lambda x: isinstance(x, str) and '@' in x)]
    # Fill missing ages with median age
    median_age = df['Age'].median()
    df['Age'] = df['Age'].fillna(median_age)
    # Drop duplicate emails
    df = df.drop_duplicates(subset=['Email'])
    return df.reset_index(drop=True)

# Step 7: Generate a simple HTML report
def generate_html_report(metrics, overall_score, filename="data_quality_report.html"):
    html = f"""
    <html>
    <head><title>Data Quality Report</title></head>
    <body>
        <h1>Data Quality Metrics</h1>
        <ul>
            <li>Completeness per column:
                <ul>
                    <li>Name: {metrics['completeness']['Name']:.2f}%</li>
                    <li>Email: {metrics['completeness']['Email']:.2f}%</li>
                    <li>Age: {metrics['completeness']['Age']:.2f}%</li>
                </ul>
            </li>
            <li>Email Validity: {metrics['validity']:.2f}%</li>
            <li>Email Uniqueness: {metrics['uniqueness']:.2f}%</li>
        </ul>
        <h2>Overall Data Quality Score: {overall_score:.2f}%</h2>
    </body>
    </html>
    """
    with open(filename, "w") as f:
        f.write(html)
    print(f"Data quality report generated: {filename}")

# Main execution
if __name__ == "__main__":
    df = generate_sample_data()
    print("Initial Dataset:")
    print(df)

    # Calculate metrics
    completeness_scores = completeness(df)
    validity_score = validity_email(df)
    uniqueness_score = uniqueness_email(df)
    overall_score = calculate_quality_score(completeness_scores, validity_score, uniqueness_score)

    print("\nData Quality Metrics (Before Cleaning):")
    print(f"Completeness:\n{completeness_scores}")
    print(f"Email Validity: {validity_score:.2f}%")
    print(f"Email Uniqueness: {uniqueness_score:.2f}%")
    print(f"Overall Quality Score: {overall_score:.2f}%")

    # Threshold for cleaning
    threshold = 90.0
    if overall_score < threshold:
        print("\nData quality below threshold. Running cleaning...")
        df_cleaned = clean_data(df)
        # Recalculate metrics after cleaning
        completeness_scores = completeness(df_cleaned)
        validity_score = validity_email(df_cleaned)
        uniqueness_score = uniqueness_email(df_cleaned)
        overall_score = calculate_quality_score(completeness_scores, validity_score, uniqueness_score)

        print("\nData Quality Metrics (After Cleaning):")
        print(f"Completeness:\n{completeness_scores}")
        print(f"Email Validity: {validity_score:.2f}%")
        print(f"Email Uniqueness: {uniqueness_score:.2f}%")
        print(f"Overall Quality Score: {overall_score:.2f}%")
    else:
        df_cleaned = df

    # Generate HTML report
    metrics = {
        "completeness": completeness_scores,
        "validity": validity_score,
        "uniqueness": uniqueness_score
    }
    generate_html_report(metrics, overall_score)


Initial Dataset:
      Name                Email   Age
0    Alice    alice@example.com  25.0
1      Bob          bob@example  30.0
2  Charlie  charlie@example.com   NaN
3     None     dave@example.com  22.0
4      Eve                 None  28.0
5    Frank    frank@example.com  35.0
6    Grace    grace@example.com   NaN
7    Heidi        heidi@example  40.0
8     Ivan     ivan@example.com  29.0
9     Judy     judy@example.com  31.0

Data Quality Metrics (Before Cleaning):
Completeness:
Name     90.0
Email    90.0
Age      80.0
dtype: float64
Email Validity: 100.00%
Email Uniqueness: 100.00%
Overall Quality Score: 95.56%
Data quality report generated: data_quality_report.html
