### Task 1: Understanding and Defining Data Quality Metrics
**Description**: Learn how to define basic data quality metrics such as completeness, validity, and uniqueness for a simple dataset.

**Steps**:
1. Dataset: Use a CSV with columns like Name , Email , Age .
2. Metric Definitions:
    - Completeness: Percentage of non-null values.
    - Validity: % of email fields containing @ .
    - Uniqueness: Count distinct entries in the Email column.

In [1]:
# Write your code from here
import pandas as pd

def calculate_completeness(series):
    return series.count() / len(series) if len(series) > 0 else 0.0

def calculate_validity_email(series):
    valid_count = series.astype(str).str.contains('@').sum()
    return valid_count / len(series) if len(series) > 0 else 0.0

def calculate_uniqueness(series):
    return series.nunique()

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', None],
        'Email': ['alice@example.com', 'bob@example.com', 'charlie@example.com', 'david.example', 'eve@example.com', ''],
        'Age': [25, 30, 22, None, 28, 31]}
df = pd.DataFrame(data)

completeness_name = calculate_completeness(df['Name'])
completeness_email = calculate_completeness(df['Email'])
completeness_age = calculate_completeness(df['Age'])

validity_email = calculate_validity_email(df['Email'])

uniqueness_email = calculate_uniqueness(df['Email'])

print(f"Completeness - Name: {completeness_name:.2f}")
print(f"Completeness - Email: {completeness_email:.2f}")
print(f"Completeness - Age: {completeness_age:.2f}")
print(f"Validity - Email: {validity_email:.2f}")
print(f"Uniqueness - Email: {uniqueness_email}")


Completeness - Name: 0.83
Completeness - Email: 1.00
Completeness - Age: 0.83
Validity - Email: 0.67
Uniqueness - Email: 6


### Task 2: Calculating Data Quality Score
**Description**: Aggregate multiple metrics to calculate an overall data quality score.

**Steps**:
1. Formula: Simple average of all metrics defined in Task 1.

In [2]:
# Write your code from here
import pandas as pd

def calculate_completeness(series):
    return series.count() / len(series) if len(series) > 0 else 0.0

def calculate_validity_email(series):
    valid_count = series.astype(str).str.contains('@').sum()
    return valid_count / len(series) if len(series) > 0 else 0.0

def calculate_uniqueness(series):
    return series.nunique() / len(series) if len(series) > 0 else 0.0

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', None],
        'Email': ['alice@example.com', 'bob@example.com', 'charlie@example.com', 'david.example', 'eve@example.com', ''],
        'Age': [25, 30, 22, None, 28, 31]}
df = pd.DataFrame(data)

completeness_name = calculate_completeness(df['Name'])
completeness_email = calculate_completeness(df['Email'])
completeness_age = calculate_completeness(df['Age'])

validity_email = calculate_validity_email(df['Email'])

uniqueness_email = calculate_uniqueness(df['Email'])

data_quality_score = (completeness_name + completeness_email + completeness_age + validity_email + uniqueness_email) / 5

print(f"Completeness - Name: {completeness_name:.2f}")
print(f"Completeness - Email: {completeness_email:.2f}")
print(f"Completeness - Age: {completeness_age:.2f}")
print(f"Validity - Email: {validity_email:.2f}")
print(f"Uniqueness - Email: {uniqueness_email:.2f}")
print(f"Overall Data Quality Score: {data_quality_score:.2f}")

Completeness - Name: 0.83
Completeness - Email: 1.00
Completeness - Age: 0.83
Validity - Email: 0.67
Uniqueness - Email: 1.00
Overall Data Quality Score: 0.87


### Task 3: Creating Expectations for a CSV
**Description**: Develop basic data quality expectations using Great Expectations.

**Steps**:
1. Expectation Suite
2. Define Expectations for Completeness

In [3]:
# Write your code from here


### Task 4: Running and Validating Expectations
**Description**: Run the created expectations and generate an output report.

**Steps**:
1. Validate
2. Generate HTML Report

In [4]:
# Write your code from here


### Task 5: Automating Data Quality Score Calculation
**Description**: Automate the data quality score via a script that integrates with Great
Expectations.

In [5]:
# Write your code from here


### Task 6: Leveraging Data Quality Metrics for Automated Data Cleaning
**Description**: Implement a system where if data quality metrics fall below a threshold,
automated data cleaning scripts are triggered.

**Steps**:
1. Define Cleaning Logic
2. Integrate with Great Expectations:
    - Use an action within the Great Expectations action list that only triggers if quality score is below a threshold, automating the cleaning.

In [6]:
# Write your code from here
# Install Great Expectations if needed
# !pip install great_expectations

import great_expectations as gx
import pandas as pd

# 1. Initialize a Data Context
context = gx.get_context()

# 2. Load your CSV file (example with sample data)
csv_path = "customer_data.csv"  # Replace with your actual file path

# Sample data creation (if you don't have a CSV yet)
sample_data = pd.DataFrame({
    'customer_id': [1001, 1002, 1003, None, 1005],
    'name': ['John', 'Jane', None, 'Mike', 'Sarah'],
    'email': ['john@email.com', 'jane@email.com', None, 'mike@email.com', None],
    'join_date': ['2023-01-01', '2023-01-02', None, '2023-01-04', '2023-01-05'],
    'purchase_amount': [125.50, 89.99, 150.00, None, 200.00]
})
sample_data.to_csv(csv_path, index=False)

# 3. Create a Validator directly from CSV
validator = gx.from_pandas(pd.read_csv(csv_path))

# 4. Create an Expectation Suite for Completeness
expectation_suite_name = "customer_data_completeness"
validator.save_expectation_suite(expectation_suite_name=expectation_suite_name)

# 5. Define Completeness Expectations

# Basic column presence
validator.expect_table_columns_to_match_ordered_list(
    column_list=[
        'customer_id',
        'name',
        'email',
        'join_date',
        'purchase_amount'
    ]
)

# Column non-null expectations
validator.expect_column_values_to_not_be_null(column="customer_id")
validator.expect_column_values_to_not_be_null(column="name")
validator.expect_column_values_to_not_be_null(column="join_date")

# Email - allow some nulls but track completeness
validator.expect_column_values_to_not_be_null(
    column="email",
    mostly=0.8  # Allow up to 20% nulls
)

# Purchase amount - allow some nulls but track
validator.expect_column_values_to_not_be_null(
    column="purchase_amount",
    mostly=0.9  # Allow up to 10% nulls
)

# 6. Save the Expectation Suite
validator.save_expectation_suite(discard_failed_expectations=False)

# 7. Validate the data
validation_result = validator.validate()

# 8. View results
print("\nCompleteness Validation Results:")
print(f"Successful expectations: {validation_result.statistics['successful_expectations']}")
print(f"Failed expectations: {validation_result.statistics['unsuccessful_expectations']}")
print(f"Success rate: {validation_result.statistics['success_percent']:.2f}%")

# 9. Generate and open HTML report
context.build_data_docs()
context.open_data_docs()

# 10. Optional: Get detailed results for failed expectations
if not validation_result.success:
    print("\nFailed Expectations Details:")
    for result in validation_result.results:
        if not result.success:
            print(f"\nColumn: {result.expectation_config.kwargs.get('column')}")
            print(f"Expectation: {result.expectation_config.expectation_type}")
            print(f"Message: {result.result.get('partial_unexpected_list')}")



AttributeError: module 'great_expectations' has no attribute 'from_pandas'