### Task 1: Understanding and Defining Data Quality Metrics
**Description**: Learn how to define basic data quality metrics such as completeness, validity, and uniqueness for a simple dataset.

**Steps**:
1. Dataset: Use a CSV with columns like Name , Email , Age .
2. Metric Definitions:
    - Completeness: Percentage of non-null values.
    - Validity: % of email fields containing @ .
    - Uniqueness: Count distinct entries in the Email column.

In [3]:
# Write your code from here
import pandas as pd

# Load the dataset (replace 'your_data.csv' with your actual file path)
csv_file = 'your_data.csv'
df = pd.read_csv(csv_file)

print("Data preview:")
print(df.head())

# 1. Completeness: % non-null values per column
completeness = df.notnull().mean() * 100  # in percentage

# 2. Validity: % of Email entries containing '@'
valid_email_mask = df['Email'].dropna().apply(lambda x: '@' in x)
validity = valid_email_mask.mean() * 100  # percentage of valid emails

# 3. Uniqueness: count of distinct emails (ignoring nulls)
unique_emails = df['Email'].nunique(dropna=True)

# Output metrics
print("\nData Quality Metrics:")
print(f"Completeness (% non-null):\n{completeness}")
print(f"\nEmail Validity (% emails containing '@'): {validity:.2f}%")
print(f"Email Uniqueness (distinct count): {unique_emails}")


FileNotFoundError: [Errno 2] No such file or directory: 'your_data.csv'

### Task 2: Calculating Data Quality Score
**Description**: Aggregate multiple metrics to calculate an overall data quality score.

**Steps**:
1. Formula: Simple average of all metrics defined in Task 1.

In [None]:
# Write your code from here
import pandas as pd

# Load dataset
csv_file = 'your_data.csv'
df = pd.read_csv(csv_file)

# 1. Completeness: average % of non-null values across all columns
completeness_per_column = df.notnull().mean() * 100
completeness = completeness_per_column.mean()  # average completeness %

# 2. Validity: % of email entries containing '@'
valid_email_mask = df['Email'].dropna().apply(lambda x: '@' in x)
validity = valid_email_mask.mean() * 100

# 3. Uniqueness: number of unique emails normalized by total non-null emails
unique_emails = df['Email'].nunique(dropna=True)
total_emails = df['Email'].notnull().sum()
uniqueness = (unique_emails / total_emails) * 100 if total_emails > 0 else 0

# 4. Calculate overall Data Quality Score (simple average)
data_quality_score = (completeness + validity + uniqueness) / 3

# Display metrics and overall score
print("Data Quality Metrics:")
print(f"Completeness: {completeness:.2f}%")
print(f"Validity: {validity:.2f}%")
print(f"Uniqueness: {uniqueness:.2f}%")
print(f"\nOverall Data Quality Score: {data_quality_score:.2f}%")


### Task 3: Creating Expectations for a CSV
**Description**: Develop basic data quality expectations using Great Expectations.

**Steps**:
1. Expectation Suite
2. Define Expectations for Completeness

In [None]:
# Write your code from here
import great_expectations as ge

# Load your CSV as a GE dataset
df_ge = ge.read_csv("your_data.csv")  # replace with your CSV path

# Create a new expectation suite (or load existing)
expectation_suite_name = "basic_completeness_suite"
df_ge.create_expectation_suite(expectation_suite_name, overwrite_existing=True)

# Add completeness expectations (no missing values) for selected columns
columns = ['Name', 'Email', 'Age']

for col in columns:
    df_ge.expect_column_values_to_not_be_null(col)

# Save the expectation suite to a JSON file
df_ge.save_expectation_suite(expectation_suite_name=expectation_suite_name)

# Validate dataset against expectations
results = df_ge.validate(expectation_suite=expectation_suite_name)

# Print validation results summary
print(f"Validation Results for suite '{expectation_suite_name}':")
print(results)



### Task 4: Running and Validating Expectations
**Description**: Run the created expectations and generate an output report.

**Steps**:
1. Validate
2. Generate HTML Report

In [None]:
# Write your code from here

import great_expectations as ge
from great_expectations.checkpoint import SimpleCheckpoint

# Load dataset
df_ge = ge.read_csv("your_data.csv")  # replace with your CSV path

# Name of the expectation suite created earlier
expectation_suite_name = "basic_completeness_suite"

# Validate the dataset against the expectation suite
results = df_ge.validate(expectation_suite=expectation_suite_name, result_format="SUMMARY")

print("Validation summary:")
print(results)

# --- Generate HTML report ---

# Initialize Great Expectations context (auto config)
context = ge.data_context.DataContext()

# Create a checkpoint to run validation and generate report
checkpoint_name = "my_checkpoint"

# Define a simple checkpoint config
checkpoint_config = {
    "name": checkpoint_name,
    "config_version": 1,
    "class_name": "SimpleCheckpoint",
    "validations": [
        {
            "batch_request": {
                "datasource_name": "default_runtime_datasource",
                "data_connector_name": "runtime_data_connector",
                "data_asset_name": "your_data_asset",
                "runtime_parameters": {"batch_data": df_ge},
                "batch_identifiers": {"default_identifier_name": "default_identifier"},
            },
            "expectation_suite_name": expectation_suite_name,
        }
    ],
}

# Add the checkpoint (overwrites if exists)
context.add_checkpoint(**checkpoint_config)

# Run the checkpoint validation
checkpoint_result = context.run_checkpoint(checkpoint_name=checkpoint_name)

# Generate HTML report path
html_report_path = checkpoint_result.list_validation_result_identifiers()[0].validation_result_identifier.expectation_suite_name + ".html"

print(f"Validation result stored. You can explore reports in your Great Expectations 'uncommitted/validations' directory.")

# Alternatively, generate standalone HTML from validation results:
from great_expectations.render.renderer import ValidationResultsPageRenderer
from great_expectations.render.view import DefaultJinjaPageView

renderer = ValidationResultsPageRenderer()
rendered_content = renderer.render(validation_results=[results])

# Save the HTML report
with open("validation_report.html", "w") as f:
    f.write(rendered_content)

print("HTML validation report generated: validation_report.html")


### Task 5: Automating Data Quality Score Calculation
**Description**: Automate the data quality score via a script that integrates with Great
Expectations.

In [None]:
# Write your code from here


### Task 6: Leveraging Data Quality Metrics for Automated Data Cleaning
**Description**: Implement a system where if data quality metrics fall below a threshold,
automated data cleaning scripts are triggered.

**Steps**:
1. Define Cleaning Logic
2. Integrate with Great Expectations:
    - Use an action within the Great Expectations action list that only triggers if quality score is below a threshold, automating the cleaning.

In [None]:
# Write your code from here
