### Task 1: Understanding and Defining Data Quality Metrics
**Description**: Learn how to define basic data quality metrics such as completeness, validity, and uniqueness for a simple dataset.

**Steps**:
1. Dataset: Use a CSV with columns like Name , Email , Age .
2. Metric Definitions:
    - Completeness: Percentage of non-null values.
    - Validity: % of email fields containing @ .
    - Uniqueness: Count distinct entries in the Email column.

In [None]:
# Write your code from here

### Task 2: Calculating Data Quality Score
**Description**: Aggregate multiple metrics to calculate an overall data quality score.

**Steps**:
1. Formula: Simple average of all metrics defined in Task 1.

In [None]:
# Write your code from here

### Task 3: Creating Expectations for a CSV
**Description**: Develop basic data quality expectations using Great Expectations.

**Steps**:
1. Expectation Suite
2. Define Expectations for Completeness

In [None]:
# Write your code from here

### Task 4: Running and Validating Expectations
**Description**: Run the created expectations and generate an output report.

**Steps**:
1. Validate
2. Generate HTML Report

In [None]:
# Write your code from here


### Task 5: Automating Data Quality Score Calculation
**Description**: Automate the data quality score via a script that integrates with Great
Expectations.

In [2]:
# Write your code from here
! pip uninstall great_expectations
! pip install "great_expectations<3.0"


Found existing installation: great-expectations 1.4.4
Uninstalling great-expectations-1.4.4:
  Would remove:
    /home/vscode/.local/lib/python3.10/site-packages/great_expectations-1.4.4.dist-info/*
    /home/vscode/.local/lib/python3.10/site-packages/great_expectations/*
Proceed (Y/n)? [31mERROR: Operation cancelled by user[0m[31m
[0m^C
Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### Task 6: Leveraging Data Quality Metrics for Automated Data Cleaning
**Description**: Implement a system where if data quality metrics fall below a threshold,
automated data cleaning scripts are triggered.

**Steps**:
1. Define Cleaning Logic
2. Integrate with Great Expectations:
    - Use an action within the Great Expectations action list that only triggers if quality score is below a threshold, automating the cleaning.

In [6]:
# Write your code from here
import pandas as pd
import numpy as np
import os

# --- Data Generation ---
def generate_sample_data():
    data = {
        "Name": ["Alice", "Bob", "Charlie", None, "Eve", "Frank", "Grace", "Heidi", "Ivan", "Judy"],
        "Email": [
            "alice@example.com", "bob@example", "charlie@example.com", "dave@example.com",
            None, "frank@example.com", "grace@example.com", "heidi@example", "ivan@example.com", "judy@example.com"
        ],
        "Age": [25, 30, None, 22, 28, 35, None, 40, 29, 31]
    }
    return pd.DataFrame(data)

# --- Validation Helpers ---
def validate_dataframe(df, required_columns):
    missing_cols = [col for col in required_columns if col not in df.columns]
    if missing_cols:
        raise ValueError(f"Missing required columns: {missing_cols}")

def validate_email_format(email):
    if not isinstance(email, str):
        return False
    return '@' in email and '.' in email.split('@')[-1]

# --- Data Quality Metrics ---
def completeness(df):
    validate_dataframe(df, df.columns)
    return df.notnull().mean() * 100

def validity_email(df):
    validate_dataframe(df, ['Email'])
    email_col = df['Email'].dropna()
    valid_emails = email_col.apply(validate_email_format)
    return valid_emails.mean() * 100 if len(email_col) > 0 else 0

def uniqueness_email(df):
    validate_dataframe(df, ['Email'])
    email_col = df['Email'].dropna()
    total = len(email_col)
    unique = email_col.nunique()
    return (unique / total) * 100 if total > 0 else 0

def calculate_quality_score(completeness_scores, validity_score, uniqueness_score):
    avg_completeness = np.mean(completeness_scores)
    return np.mean([avg_completeness, validity_score, uniqueness_score])

# --- Data Cleaning ---
def clean_data(df):
    # Fill missing names
    df['Name'] = df['Name'].fillna('Unknown')
    # Remove invalid emails using vectorized boolean indexing
    valid_email_mask = df['Email'].apply(validate_email_format)
    df = df[valid_email_mask]
    # Fill missing ages with median
    median_age = df['Age'].median()
    df['Age'] = df['Age'].fillna(median_age)
    # Drop duplicate emails
    df = df.drop_duplicates(subset=['Email'])
    return df.reset_index(drop=True)

# --- Reporting ---
def generate_html_report(metrics, overall_score, filename="data_quality_report.html"):
    html = f"""
    <html><head><title>Data Quality Report</title></head><body>
    <h1>Data Quality Metrics</h1>
    <ul>
        <li>Completeness per column:
            <ul>
                {''.join([f'<li>{col}: {score:.2f}%</li>' for col, score in metrics['completeness'].items()])}
            </ul>
        </li>
        <li>Email Validity: {metrics['validity']:.2f}%</li>
        <li>Email Uniqueness: {metrics['uniqueness']:.2f}%</li>
    </ul>
    <h2>Overall Data Quality Score: {overall_score:.2f}%</h2>
    </body></html>
    """
    try:
        with open(filename, "w") as f:
            f.write(html)
        print(f"Report generated: {filename}")
    except Exception as e:
        print(f"Error writing report: {e}")

# --- Main Execution ---
def main():
    try:
        df = generate_sample_data()
        print("Initial Dataset:\n", df)

        completeness_scores = completeness(df)
        validity_score = validity_email(df)
        uniqueness_score = uniqueness_email(df)
        overall_score = calculate_quality_score(completeness_scores, validity_score, uniqueness_score)

        print("\nData Quality Metrics (Before Cleaning):")
        print(completeness_scores)
        print(f"Email Validity: {validity_score:.2f}%")
        print(f"Email Uniqueness: {uniqueness_score:.2f}%")
        print(f"Overall Score: {overall_score:.2f}%")

        threshold = 90.0
        if overall_score < threshold:
            print("\nQuality below threshold, cleaning data...")
            df = clean_data(df)

            completeness_scores = completeness(df)
            validity_score = validity_email(df)
            uniqueness_score = uniqueness_email(df)
            overall_score = calculate_quality_score(completeness_scores, validity_score, uniqueness_score)

            print("\nData Quality Metrics (After Cleaning):")
            print(completeness_scores)
            print(f"Email Validity: {validity_score:.2f}%")
            print(f"Email Uniqueness: {uniqueness_score:.2f}%")
            print(f"Overall Score: {overall_score:.2f}%")

        generate_html_report({
            "completeness": completeness_scores,
            "validity": validity_score,
            "uniqueness": uniqueness_score
        }, overall_score)

    except Exception as e:
        print(f"Error during processing: {e}")

if __name__ == "__main__":
    main()


Initial Dataset:
       Name                Email   Age
0    Alice    alice@example.com  25.0
1      Bob          bob@example  30.0
2  Charlie  charlie@example.com   NaN
3     None     dave@example.com  22.0
4      Eve                 None  28.0
5    Frank    frank@example.com  35.0
6    Grace    grace@example.com   NaN
7    Heidi        heidi@example  40.0
8     Ivan     ivan@example.com  29.0
9     Judy     judy@example.com  31.0

Data Quality Metrics (Before Cleaning):
Name     90.0
Email    90.0
Age      80.0
dtype: float64
Email Validity: 77.78%
Email Uniqueness: 100.00%
Overall Score: 88.15%

Quality below threshold, cleaning data...

Data Quality Metrics (After Cleaning):
Name     100.0
Email    100.0
Age      100.0
dtype: float64
Email Validity: 100.00%
Email Uniqueness: 100.00%
Overall Score: 100.00%
Report generated: data_quality_report.html


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Age'] = df['Age'].fillna(median_age)
