### Task 1: Understanding and Defining Data Quality Metrics
**Description**: Learn how to define basic data quality metrics such as completeness, validity, and uniqueness for a simple dataset.

**Steps**:
1. Dataset: Use a CSV with columns like Name , Email , Age .
2. Metric Definitions:
    - Completeness: Percentage of non-null values.
    - Validity: % of email fields containing @ .
    - Uniqueness: Count distinct entries in the Email column.

In [1]:
# Write your code from here
import pandas as pd
from io import StringIO

# Sample CSV data as a multi-line string
csv_data = """Name,Email,Age
Alice,alice@example.com,30
Bob,bob.example.com,25
Charlie,charlie@example.com,
David,david@example,22
,Eve@example.com,28
Frank,frank@example.com,33
"""

# Use StringIO to simulate reading from a file
df = pd.read_csv(StringIO(csv_data))

# 1. Completeness: Percentage of non-null values per column
completeness = df.notnull().mean() * 100

# 2. Validity: % of email fields containing '@'
valid_emails = df['Email'].str.contains('@', na=False).mean() * 100

# 3. Uniqueness: Count distinct entries in the Email column
unique_emails = df['Email'].nunique()

# Print results
print("Data Quality Metrics:")
print("---------------------")
print(f"Completeness per column (%):\n{completeness}\n")
print(f"Email Validity (% with '@'): {valid_emails:.2f}%")
print(f"Unique Emails Count: {unique_emails}")


Data Quality Metrics:
---------------------
Completeness per column (%):
Name      83.333333
Email    100.000000
Age       83.333333
dtype: float64

Email Validity (% with '@'): 83.33%
Unique Emails Count: 6


### Task 2: Calculating Data Quality Score
**Description**: Aggregate multiple metrics to calculate an overall data quality score.

**Steps**:
1. Formula: Simple average of all metrics defined in Task 1.

In [2]:
# Write your code from here
import pandas as pd
from io import StringIO

# Sample CSV data
csv_data = """Name,Email,Age
Alice,alice@example.com,30
Bob,bob.example.com,25
Charlie,charlie@example.com,
David,david@example,22
,Eve@example.com,28
Frank,frank@example.com,33
"""

# Load data into DataFrame
df = pd.read_csv(StringIO(csv_data))

# 1. Completeness: Percentage of non-null values per column
completeness = df.notnull().mean() * 100

# 2. Validity: % of email fields containing '@'
valid_emails = df['Email'].str.contains('@', na=False).mean() * 100

# 3. Uniqueness: Percentage of unique emails relative to total emails
unique_emails_count = df['Email'].nunique()
total_emails_count = df['Email'].notnull().sum()
uniqueness = (unique_emails_count / total_emails_count) * 100 if total_emails_count > 0 else 0

# Calculate Data Quality Score: average of completeness, validity, and uniqueness
# We take average completeness across columns
avg_completeness = completeness.mean()
data_quality_score = (avg_completeness + valid_emails + uniqueness) / 3

# Print results
print("Data Quality Metrics:")
print("---------------------")
print(f"Completeness per column (%):\n{completeness}\n")
print(f"Average Completeness (%): {avg_completeness:.2f}%")
print(f"Email Validity (% with '@'): {valid_emails:.2f}%")
print(f"Email Uniqueness (% unique): {uniqueness:.2f}%")
print("---------------------")
print(f"Overall Data Quality Score: {data_quality_score:.2f}%")


Data Quality Metrics:
---------------------
Completeness per column (%):
Name      83.333333
Email    100.000000
Age       83.333333
dtype: float64

Average Completeness (%): 88.89%
Email Validity (% with '@'): 83.33%
Email Uniqueness (% unique): 100.00%
---------------------
Overall Data Quality Score: 90.74%


### Task 3: Creating Expectations for a CSV
**Description**: Develop basic data quality expectations using Great Expectations.

**Steps**:
1. Expectation Suite
2. Define Expectations for Completeness

In [3]:
# Write your code from here
import pandas as pd
import great_expectations as ge

# Step 1: Create sample CSV file
csv_content = """Name,Email,Age
Alice,alice@example.com,30
Bob,bob.example.com,25
Charlie,charlie@example.com,
David,david@example,22
,Eve@example.com,28
Frank,frank@example.com,33
"""

with open("sample_data.csv", "w") as f:
    f.write(csv_content)

# Step 2: Load CSV as Great Expectations dataset
df_ge = ge.read_csv("sample_data.csv")

# Step 3: Create an Expectation Suite and add completeness expectations
expectation_suite_name = "completeness_suite"

# Set default result format to SUMMARY for clean output
df_ge.set_default_expectation_argument("result_format", "SUMMARY")

# Add expectations: values should not be null for all columns
for col in df_ge.columns:
    df_ge.expect_column_values_to_not_be_null(col)

# Step 4: Save the expectation suite (optional)
df_ge.save_expectation_suite(expectation_suite_name + ".json")

# Step 5: Validate the dataset against the expectations
results = df_ge.validate()

# Step 6: Print validation results
print("Validation Results Summary:")
print(results)


AttributeError: module 'great_expectations' has no attribute 'read_csv'

### Task 4: Running and Validating Expectations
**Description**: Run the created expectations and generate an output report.

**Steps**:
1. Validate
2. Generate HTML Report

In [None]:
# Write your code from here
import pandas as pd
import great_expectations as ge
from great_expectations.data_context.types.resource_identifiers import ValidationResultIdentifier

# Step 1: Create sample CSV file (if not already created)
csv_content = """Name,Email,Age
Alice,alice@example.com,30
Bob,bob.example.com,25
Charlie,charlie@example.com,
David,david@example,22
,Eve@example.com,28
Frank,frank@example.com,33
"""

with open("sample_data.csv", "w") as f:
    f.write(csv_content)

# Step 2: Load CSV as GE dataset
df_ge = ge.read_csv("sample_data.csv")

# Step 3: Create an Expectation Suite and add completeness expectations
expectation_suite_name = "completeness_suite"
df_ge.set_default_expectation_argument("result_format", "SUMMARY")

for col in df_ge.columns:
    df_ge.expect_column_values_to_not_be_null(col)

# Save the expectation suite (optional)
df_ge.save_expectation_suite(expectation_suite_name + ".json")

# Step 4: Run validation
results = df_ge.validate()

print("Validation Results Summary:")
print(results)

# Step 5: Generate HTML report
# For this we use a DataContext to create a validation result and render an HTML site

from great_expectations.data_context import DataContext
import json

# Initialize a DataContext in a temp directory
context = DataContext()

# Save the expectation suite to the DataContext's default location
context.save_expectation_suite(df_ge.get_expectation_suite(), expectation_suite_name)

# Run validation through DataContext to get a ValidationResultIdentifier
batch_kwargs = {"path": "sample_data.csv", "datasource": "default_datasource"}
batch = context.get_batch(batch_kwargs, expectation_suite_name)
validation_result = context.run_validation_operator(
    "action_list_operator", assets_to_validate=[batch]
)

# Get the validation result identifier
validation_result_identifier = validation_result.list_validation_result_identifiers()[0]

# Build HTML report for the validation result
context.build_data_docs()

# Get the URL of the generated HTML validation report
site_urls = context.get_docs_sites_urls()
print("\nValidation HTML report available at:")
print(site_urls)



AttributeError: module 'great_expectations' has no attribute 'read_csv'

### Task 5: Automating Data Quality Score Calculation
**Description**: Automate the data quality score via a script that integrates with Great
Expectations.

In [None]:
# Write your code from here
import pandas as pd
import great_expectations as ge

# Sample CSV content
csv_content = """Name,Email,Age
Alice,alice@example.com,30
Bob,bob.example.com,25
Charlie,charlie@example.com,
David,david@example,22
,Eve@example.com,28
Frank,frank@example.com,33
"""

# Write CSV to file
with open("sample_data.csv", "w") as f:
    f.write(csv_content)

# Load CSV as Great Expectations dataframe
df_ge = ge.read_csv("sample_data.csv")

# Create Expectation Suite for completeness
expectation_suite_name = "completeness_suite"
df_ge.set_default_expectation_argument("result_format", "SUMMARY")

for col in df_ge.columns:
    df_ge.expect_column_values_to_not_be_null(col)

# Validate completeness expectations
validation_results = df_ge.validate()

# Extract completeness results from validation
completeness_scores = []
for result in validation_results["results"]:
    if result["expectation_config"]["expectation_type"] == "expect_column_values_to_not_be_null":
        completeness_scores.append(result["result"]["element_count"] - result["result"]["missing_count"])

total_rows = validation_results["results"][0]["result"]["element_count"] if validation_results["results"] else len(df_ge)

avg_completeness_pct = (sum(completeness_scores) / (total_rows * len(df_ge.columns))) * 100 if total_rows > 0 else 0

# Additional metrics calculated manually on original pandas dataframe
df = pd.read_csv("sample_data.csv")

# Validity: % of email fields containing '@'
valid_emails_pct = df['Email'].str.contains('@', na=False).mean() * 100

# Uniqueness: % of unique emails relative to total non-null emails
unique_email_count = df['Email'].nunique()
total_non_null_emails = df['Email'].notnull().sum()
uniqueness_pct = (unique_email_count / total_non_null_emails) * 100 if total_non_null_emails > 0 else 0

# Calculate overall Data Quality Score (average of the three metrics)
data_quality_score = (avg_completeness_pct + valid_emails_pct + uniqueness_pct) / 3

# Output the scores
print(f"Completeness (Average %): {avg_completeness_pct:.2f}%")
print(f"Email Validity (% containing '@'): {valid_emails_pct:.2f}%")
print(f"Email Uniqueness (% unique emails): {uniqueness_pct:.2f}%")
print(f"\nOverall Data Quality Score: {data_quality_score:.2f}%")



AttributeError: module 'great_expectations' has no attribute 'read_csv'

### Task 6: Leveraging Data Quality Metrics for Automated Data Cleaning
**Description**: Implement a system where if data quality metrics fall below a threshold,
automated data cleaning scripts are triggered.

**Steps**:
1. Define Cleaning Logic
2. Integrate with Great Expectations:
    - Use an action within the Great Expectations action list that only triggers if quality score is below a threshold, automating the cleaning.

In [None]:
# Write your code from here
import pandas as pd
import great_expectations as ge

# Step 1: Create sample CSV file
csv_content = """Name,Email,Age
Alice,alice@example.com,30
Bob,bob.example.com,25
Charlie,charlie@example.com,
David,david@example,22
,Eve@example.com,28
Frank,frank@example.com,33
"""

with open("sample_data.csv", "w") as f:
    f.write(csv_content)

# Step 2: Define automated cleaning logic
def clean_data(df: pd.DataFrame) -> pd.DataFrame:
    # Fill missing Age with median
    df['Age'] = df['Age'].fillna(df['Age'].median())
    # Keep only emails containing '@'
    df = df[df['Email'].str.contains('@', na=False)]
    # Fill missing Names with 'Unknown'
    df['Name'] = df['Name'].fillna('Unknown')
    return df

# Step 3: Load data and wrap with Great Expectations
df = pd.read_csv("sample_data.csv")
df_ge = ge.from_pandas(df)

# Step 4: Set completeness expectations for all columns
for col in df_ge.columns:
    df_ge.expect_column_values_to_not_be_null(col)

# Step 5: Validate data completeness
results = df_ge.validate()

# Step 6: Calculate completeness % from validation results
total_rows = len(df)
completeness_counts = []
for res in results["results"]:
    if res["expectation_config"]["expectation_type"] == "expect_column_values_to_not_be_null":
        missing = res["result"]["missing_count"]
        completeness_counts.append(total_rows - missing)

avg_completeness_pct = (sum(completeness_counts) / (total_rows * len(df.columns))) * 100

# Step 7: Calculate email validity (% emails containing '@')
validity_pct = df['Email'].str.contains('@', na=False).mean() * 100

# Step 8: Calculate uniqueness (% unique emails)
unique_email_count = df['Email'].nunique()
non_null_email_count = df['Email'].notnull().sum()
uniqueness_pct = (unique_email_count / non_null_email_count) * 100 if non_null_email_count > 0 else 0

# Step 9: Calculate overall Data Quality Score
data_quality_score = (avg_completeness_pct + validity_pct + uniqueness_pct) / 3

print(f"Completeness: {avg_completeness_pct:.2f}%")
print(f"Email Validity: {validity_pct:.2f}%")
print(f"Email Uniqueness: {uniqueness_pct:.2f}%")
print(f"Overall Data Quality Score: {data_quality_score:.2f}%")

# Step 10: Automate cleaning if score below threshold
threshold = 80
if data_quality_score < threshold:
    print(f"Data Quality Score below {threshold}%. Running cleaning script...")
    df_cleaned = clean_data(df)
    df_cleaned.to_csv("cleaned_sample_data.csv", index=False)
    print("Cleaned data saved to 'cleaned_sample_data.csv'.")
else:
    print(f"Data Quality Score meets threshold. No cleaning needed.")



AttributeError: module 'great_expectations' has no attribute 'from_pandas'