### Task 1: Understanding and Defining Data Quality Metrics
**Description**: Learn how to define basic data quality metrics such as completeness, validity, and uniqueness for a simple dataset.

**Steps**:
1. Dataset: Use a CSV with columns like Name , Email , Age .
2. Metric Definitions:
    - Completeness: Percentage of non-null values.
    - Validity: % of email fields containing @ .
    - Uniqueness: Count distinct entries in the Email column.

### Task 2: Calculating Data Quality Score
**Description**: Aggregate multiple metrics to calculate an overall data quality score.

**Steps**:
1. Formula: Simple average of all metrics defined in Task 1.

### Task 3: Creating Expectations for a CSV
**Description**: Develop basic data quality expectations using Great Expectations.

**Steps**:
1. Expectation Suite
2. Define Expectations for Completeness

### Task 4: Running and Validating Expectations
**Description**: Run the created expectations and generate an output report.

**Steps**:
1. Validate
2. Generate HTML Report

### Task 5: Automating Data Quality Score Calculation
**Description**: Automate the data quality score via a script that integrates with Great
Expectations.

In [21]:
import pandas as pd
import re
import matplotlib.pyplot as plt

# Sample Data (replace with your own dataset)
data = {
    'name': ['Alice', 'Bob', 'Charlie', None, 'Eve'],
    'age': [25, 30, 35, None, 22],
    'email': ['alice@example.com', 'bob@example', 'charlie@domain.com', 'eve@domain.com', None]
}

# Create DataFrame
df = pd.DataFrame(data)

# Helper function to check for empty DataFrame
def check_empty(df):
    """Check if the DataFrame is empty."""
    if df.empty:
        raise ValueError("The input DataFrame is empty!")
    
# Helper function to check if column exists in DataFrame
def check_column_exists(df, column):
    """Check if the required column exists in the DataFrame."""
    if column not in df.columns:
        raise ValueError(f"Column '{column}' not found in the DataFrame!")

# Function to calculate completeness percentage for each column
def completeness(df):
    """Calculate completeness percentage for each column."""
    try:
        check_empty(df)  # Ensure the DataFrame is not empty
        return df.isnull().mean() * 100
    except ValueError as ve:
        print(f"Error: {ve}")
        return None
    except Exception as e:
        print(f"Unexpected error: {e}")
        return None

# Function to validate email addresses in the DataFrame
def validity(df, column):
    """Check if the values in the specified column are valid emails."""
    try:
        check_empty(df)  # Ensure the DataFrame is not empty
        check_column_exists(df, column)  # Check if the column exists
        # Regex pattern for email validation
        pattern = r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$'
        valid_emails = df[column].str.match(pattern).fillna(False)
        return valid_emails.mean() * 100
    except ValueError as ve:
        print(f"Error: {ve}")
        return None
    except Exception as e:
        print(f"Unexpected error in validity function for {column}: {e



SyntaxError: unterminated string literal (detected at line 54) (3763241791.py, line 54)

### Task 6: Leveraging Data Quality Metrics for Automated Data Cleaning
**Description**: Implement a system where if data quality metrics fall below a threshold,
automated data cleaning scripts are triggered.

**Steps**:
1. Define Cleaning Logic
2. Integrate with Great Expectations:
    - Use an action within the Great Expectations action list that only triggers if quality score is below a threshold, automating the cleaning.