### Task 1: Understanding and Defining Data Quality Metrics
**Description**: Learn how to define basic data quality metrics such as completeness, validity, and uniqueness for a simple dataset.

**Steps**:
1. Dataset: Use a CSV with columns like Name , Email , Age .
2. Metric Definitions:
    - Completeness: Percentage of non-null values.
    - Validity: % of email fields containing @ .
    - Uniqueness: Count distinct entries in the Email column.

In [None]:
# Write your code from here
# Write your code from here
import pandas as pd

def calculate_completeness(series):
    return series.count() / len(series) if len(series) > 0 else 0.0

def calculate_validity_email(series):
    valid_count = series.astype(str).str.contains('@').sum()
    return valid_count / len(series) if len(series) > 0 else 0.0

def calculate_uniqueness(series):
    return series.nunique()

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', None],
        'Email': ['alice@example.com', 'bob@example.com', 'charlie@example.com', 'david.example', 'eve@example.com', ''],
        'Age': [25, 30, 22, None, 28, 31]}
df = pd.DataFrame(data)

completeness_name = calculate_completeness(df['Name'])
completeness_email = calculate_completeness(df['Email'])
completeness_age = calculate_completeness(df['Age'])

validity_email = calculate_validity_email(df['Email'])

uniqueness_email = calculate_uniqueness(df['Email'])

print(f"Completeness - Name: {completeness_name:.2f}")
print(f"Completeness - Email: {completeness_email:.2f}")
print(f"Completeness - Age: {completeness_age:.2f}")
print(f"Validity - Email: {validity_email:.2f}")
print(f"Uniqueness - Email: {uniqueness_email}")


### Task 2: Calculating Data Quality Score
**Description**: Aggregate multiple metrics to calculate an overall data quality score.

**Steps**:
1. Formula: Simple average of all metrics defined in Task 1.

In [None]:
# Write your code from here
# Write your code from here
import pandas as pd

def calculate_completeness(series):
    return series.count() / len(series) if len(series) > 0 else 0.0

def calculate_validity_email(series):
    valid_count = series.astype(str).str.contains('@').sum()
    return valid_count / len(series) if len(series) > 0 else 0.0

def calculate_uniqueness(series):
    return series.nunique() / len(series) if len(series) > 0 else 0.0

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', None],
        'Email': ['alice@example.com', 'bob@example.com', 'charlie@example.com', 'david.example', 'eve@example.com', ''],
        'Age': [25, 30, 22, None, 28, 31]}
df = pd.DataFrame(data)

completeness_name = calculate_completeness(df['Name'])
completeness_email = calculate_completeness(df['Email'])
completeness_age = calculate_completeness(df['Age'])

validity_email = calculate_validity_email(df['Email'])

uniqueness_email = calculate_uniqueness(df['Email'])

data_quality_score = (completeness_name + completeness_email + completeness_age + validity_email + uniqueness_email) / 5

print(f"Completeness - Name: {completeness_name:.2f}")
print(f"Completeness - Email: {completeness_email:.2f}")
print(f"Completeness - Age: {completeness_age:.2f}")
print(f"Validity - Email: {validity_email:.2f}")
print(f"Uniqueness - Email: {uniqueness_email:.2f}")
print(f"Overall Data Quality Score: {data_quality_score:.2f}")

### Task 3: Creating Expectations for a CSV
**Description**: Develop basic data quality expectations using Great Expectations.

**Steps**:
1. Expectation Suite
2. Define Expectations for Completeness

In [None]:
# Write your code from here

### Task 4: Running and Validating Expectations
**Description**: Run the created expectations and generate an output report.

**Steps**:
1. Validate
2. Generate HTML Report

In [None]:
# Write your code from here


### Task 5: Automating Data Quality Score Calculation
**Description**: Automate the data quality score via a script that integrates with Great
Expectations.

In [None]:
# Write your code from here


### Task 6: Leveraging Data Quality Metrics for Automated Data Cleaning
**Description**: Implement a system where if data quality metrics fall below a threshold,
automated data cleaning scripts are triggered.

**Steps**:
1. Define Cleaning Logic
2. Integrate with Great Expectations:
    - Use an action within the Great Expectations action list that only triggers if quality score is below a threshold, automating the cleaning.

In [None]:
# Write your code from here
