### Task 1: Understanding and Defining Data Quality Metrics
**Description**: Learn how to define basic data quality metrics such as completeness, validity, and uniqueness for a simple dataset.

**Steps**:
1. Dataset: Use a CSV with columns like Name , Email , Age .
2. Metric Definitions:
    - Completeness: Percentage of non-null values.
    - Validity: % of email fields containing @ .
    - Uniqueness: Count distinct entries in the Email column.

In [None]:
# Write your code from here

### Task 2: Calculating Data Quality Score
**Description**: Aggregate multiple metrics to calculate an overall data quality score.

**Steps**:
1. Formula: Simple average of all metrics defined in Task 1.

In [None]:
# Write your code from here

### Task 3: Creating Expectations for a CSV
**Description**: Develop basic data quality expectations using Great Expectations.

**Steps**:
1. Expectation Suite
2. Define Expectations for Completeness

In [None]:
# Write your code from here

### Task 4: Running and Validating Expectations
**Description**: Run the created expectations and generate an output report.

**Steps**:
1. Validate
2. Generate HTML Report

In [None]:
# Write your code from here


### Task 5: Automating Data Quality Score Calculation
**Description**: Automate the data quality score via a script that integrates with Great
Expectations.

In [8]:
# Write your code from here

! pip install pytest


Defaulting to user installation because normal site-packages is not writeable
Collecting pytest
  Downloading pytest-8.3.5-py3-none-any.whl (343 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m343.6/343.6 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting tomli>=1
  Downloading tomli-2.2.1-py3-none-any.whl (14 kB)
Collecting pluggy<2,>=1.5
  Downloading pluggy-1.6.0-py3-none-any.whl (20 kB)
Collecting iniconfig
  Downloading iniconfig-2.1.0-py3-none-any.whl (6.0 kB)
Installing collected packages: tomli, pluggy, iniconfig, pytest
Successfully installed iniconfig-2.1.0 pluggy-1.6.0 pytest-8.3.5 tomli-2.2.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### Task 6: Leveraging Data Quality Metrics for Automated Data Cleaning
**Description**: Implement a system where if data quality metrics fall below a threshold,
automated data cleaning scripts are triggered.

**Steps**:
1. Define Cleaning Logic
2. Integrate with Great Expectations:
    - Use an action within the Great Expectations action list that only triggers if quality score is below a threshold, automating the cleaning.

In [9]:
# Write your code from here
import pytest
import pandas as pd
from your_module import completeness, validity_email, uniqueness_email, clean_data  # Replace 'your_module' with your script name without .py

@pytest.fixture
def sample_df():
    return pd.DataFrame({
        "Name": ["Alice", None, "Bob"],
        "Email": ["alice@example.com", "invalid_email", None],
        "Age": [25, None, 30]
    })

def test_completeness(sample_df):
    comp = completeness(sample_df)
    assert comp["Name"] == pytest.approx(2/3 * 100)
    assert comp["Email"] == pytest.approx(2/3 * 100)
    assert comp["Age"] == pytest.approx(2/3 * 100)

def test_validity_email(sample_df):
    val = validity_email(sample_df)
    # Only one valid email 'alice@example.com' out of 2 non-null emails
    assert val == pytest.approx(0.5 * 100)

def test_uniqueness_email(sample_df):
    uniq = uniqueness_email(sample_df)
    # Two non-null emails, both unique
    assert uniq == 100

def test_clean_data(sample_df):
    cleaned = clean_data(sample_df)
    # After cleaning, invalid email row removed, missing name filled
    assert cleaned["Name"].isnull().sum() == 0
    assert all(cleaned["Email"].apply(lambda x: "@" in x))
    assert cleaned.shape[0] == 1  # Only one valid email row remains


ModuleNotFoundError: No module named 'your_module'