### Task 1: Understanding and Defining Data Quality Metrics
**Description**: Learn how to define basic data quality metrics such as completeness, validity, and uniqueness for a simple dataset.

**Steps**:
1. Dataset: Use a CSV with columns like Name , Email , Age .
2. Metric Definitions:
    - Completeness: Percentage of non-null values.
    - Validity: % of email fields containing @ .
    - Uniqueness: Count distinct entries in the Email column.

In [1]:
# Write your code from here
import pandas as pd

# Sample dataset
data = {
    'Name': ['Alice', 'Bob', 'Charlie', None, 'Eve'],
    'Email': ['alice@example.com', 'bob@example', 'charlie@example.com', None, 'eve@example.com'],
    'Age': [25, 30, None, 22, 28]
}

df = pd.DataFrame(data)

# Completeness
completeness = df.notnull().mean() * 100

# Validity for Email (check if contains '@')
valid_emails = df['Email'].dropna().apply(lambda x: '@' in x)
validity = valid_emails.mean() * 100

# Uniqueness of Email
uniqueness = df['Email'].nunique(dropna=True)

print(f"Completeness (% non-null values):\n{completeness}\n")
print(f"Email Validity (% with '@'): {validity:.2f}%")
print(f"Email Uniqueness (distinct count): {uniqueness}")


Completeness (% non-null values):
Name     80.0
Email    80.0
Age      80.0
dtype: float64

Email Validity (% with '@'): 100.00%
Email Uniqueness (distinct count): 4


### Task 2: Calculating Data Quality Score
**Description**: Aggregate multiple metrics to calculate an overall data quality score.

**Steps**:
1. Formula: Simple average of all metrics defined in Task 1.

In [2]:

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', None, 'Eve'],
    'Email': ['alice@example.com', 'bob@example', 'charlie@example.com', None, 'eve@example.com'],
    'Age': [25, 30, None, 22, 28]
}

df = pd.DataFrame(data)

# Completeness for Email column
completeness_email = df['Email'].notnull().mean() * 100

# Validity for Email (contain '@')
valid_emails = df['Email'].dropna().apply(lambda x: '@' in x)
validity_email = valid_emails.mean() * 100

# Uniqueness for Email
unique_emails = df['Email'].nunique(dropna=True)
total_non_null_emails = df['Email'].notnull().sum()
normalized_uniqueness = (unique_emails / total_non_null_emails) * 100 if total_non_null_emails > 0 else 0

# Data Quality Score (average)
data_quality_score = (completeness_email + validity_email + normalized_uniqueness) / 3

print(f"Completeness (Email): {completeness_email:.2f}%")
print(f"Validity (Email): {validity_email:.2f}%")
print(f"Uniqueness (Email): {normalized_uniqueness:.2f}%")
print(f"Overall Data Quality Score: {data_quality_score:.2f}%")


Completeness (Email): 80.00%
Validity (Email): 100.00%
Uniqueness (Email): 100.00%
Overall Data Quality Score: 93.33%


### Task 3: Creating Expectations for a CSV
**Description**: Develop basic data quality expectations using Great Expectations.

**Steps**:
1. Expectation Suite
2. Define Expectations for Completeness

In [1]:
!pip install great_expectations


Defaulting to user installation because normal site-packages is not writeable
Collecting pandas<2.2,>=1.3.0
  Using cached pandas-2.1.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.3 MB)
Collecting numpy>=1.22.4
  Using cached numpy-1.26.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
Installing collected packages: numpy, pandas
  Attempting uninstall: numpy
    Found existing installation: numpy 2.1.3
    Uninstalling numpy-2.1.3:
      Successfully uninstalled numpy-2.1.3
  Attempting uninstall: pandas
    Found existing installation: pandas 2.2.3
    Uninstalling pandas-2.2.3:
      Successfully uninstalled pandas-2.2.3
Successfully installed numpy-1.26.4 pandas-2.1.4

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
!great_expectations init


/bin/bash: line 1: great_expectations: command not found


In [3]:
!great_expectations suite new


/bin/bash: line 1: great_expectations: command not found


In [7]:
import pandas as pd
from great_expectations.dataset import PandasDataset
from great_expectations.data_context import DataContext

# Step 1: Load CSV with pandas
df = pd.read_csv('/workspaces/AI_DATA_ANALYSIS_/src/Module 8/Hands-on - Data Quality Scoring & Automation/sample_data.csv')

# Step 2: Wrap pandas DataFrame as a GE PandasDataset
df_ge = PandasDataset(df)

# Step 3: Initialize Great Expectations DataContext
context = DataContext()

# Step 4: Create a new expectation suite
suite_name = "basic_suite"
suite = context.create_expectation_suite(suite_name, overwrite_existing=True)

# Step 5: Add completeness expectations for columns
for column in ['Name', 'Email', 'Age']:
    suite.add_expectation(
        expectation_type="expect_column_values_to_not_be_null",
        kwargs={"column": column},
    )

# Step 6: Save the expectation suite
context.save_expectation_suite(suite, suite_name)

# Step 7: Validate the data against the suite
results = df_ge.validate(expectation_suite=suite)
print(results)


ModuleNotFoundError: No module named 'great_expectations.dataset'

### Task 4: Running and Validating Expectations
**Description**: Run the created expectations and generate an output report.

**Steps**:
1. Validate
2. Generate HTML Report

In [1]:
# Write your code from here
from great_expectations.checkpoint import Checkpoint
from great_expectations.data_context import get_context

# Get the context
context = get_context()

# Run the checkpoint
checkpoint_result = context.run_checkpoint(checkpoint_name="my_checkpoint")


AttributeError: 'EphemeralDataContext' object has no attribute 'run_checkpoint'

### Task 5: Automating Data Quality Score Calculation
**Description**: Automate the data quality score via a script that integrates with Great
Expectations.

In [2]:
# Write your code from here
import json
from great_expectations.data_context import get_context

def calculate_data_quality_score(result):
    """
    Calculate data quality score from validation results.
    Returns score as percentage of successful expectations.
    """
    success_count = 0
    total_count = 0

    for validation in result["run_results"].values():
        expectations = validation["validation_result"]["results"]
        for expectation in expectations:
            total_count += 1
            if expectation["success"]:
                success_count += 1

    if total_count == 0:
        return 0.0
    return (success_count / total_count) * 100

def run_checkpoint_and_score(checkpoint_name="my_checkpoint"):
    # Load context and run checkpoint
    context = get_context()
    checkpoint_result = context.run_checkpoint(checkpoint_name=checkpoint_name)

    # Convert result to dict
    result_dict = checkpoint_result.to_json_dict()

    # Calculate data quality score
    score = calculate_data_quality_score(result_dict)

    print(f"✅ Data Quality Score: {score:.2f}%")

    # Optional: save result to JSON
    with open("data_quality_result.json", "w") as f:
        json.dump(result_dict, f, indent=2)

    return score

# Run script
if __name__ == "__main__":
    run_checkpoint_and_score("my_checkpoint")



AttributeError: 'EphemeralDataContext' object has no attribute 'run_checkpoint'

### Task 6: Leveraging Data Quality Metrics for Automated Data Cleaning
**Description**: Implement a system where if data quality metrics fall below a threshold,
automated data cleaning scripts are triggered.

**Steps**:
1. Define Cleaning Logic
2. Integrate with Great Expectations:
    - Use an action within the Great Expectations action list that only triggers if quality score is below a threshold, automating the cleaning.

In [4]:
# Write your code from here
from great_expectations.checkpoint.actions import Action
import subprocess

class TriggerCleaningAction(Action):
    def __init__(self, threshold: float = 90.0, cleaning_script: str = "clean_data.py", **kwargs):
        super().__init__(**kwargs)
        self.threshold = threshold
        self.cleaning_script = cleaning_script

    def run(self, validation_result_suite, **kwargs):
        total = len(validation_result_suite["results"])
        passed = sum(1 for r in validation_result_suite["results"] if r["success"])
        score = (passed / total) * 100 if total > 0 else 0

        print(f"📊 Data Quality Score: {score:.2f}%")
        if score < self.threshold:
            print("⚠️ Quality score below threshold — triggering data cleaning.")
            subprocess.run(["python", self.cleaning_script])
        else:
            print("✅ Data quality acceptable — no cleaning required.")


ImportError: cannot import name 'Action' from 'great_expectations.checkpoint.actions' (/home/vscode/.local/lib/python3.10/site-packages/great_expectations/checkpoint/actions.py)