### Task 1: Understanding and Defining Data Quality Metrics
**Description**: Learn how to define basic data quality metrics such as completeness, validity, and uniqueness for a simple dataset.

**Steps**:
1. Dataset: Use a CSV with columns like Name , Email , Age .
2. Metric Definitions:
    - Completeness: Percentage of non-null values.
    - Validity: % of email fields containing @ .
    - Uniqueness: Count distinct entries in the Email column.

In [1]:
# Write your code from here
# Write your code from here
import pandas as pd

def calculate_completeness(series):
    return series.count() / len(series) if len(series) > 0 else 0.0

def calculate_validity_email(series):
    valid_count = series.astype(str).str.contains('@').sum()
    return valid_count / len(series) if len(series) > 0 else 0.0

def calculate_uniqueness(series):
    return series.nunique()

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', None],
        'Email': ['alice@example.com', 'bob@example.com', 'charlie@example.com', 'david.example', 'eve@example.com', ''],
        'Age': [25, 30, 22, None, 28, 31]}
df = pd.DataFrame(data)

completeness_name = calculate_completeness(df['Name'])
completeness_email = calculate_completeness(df['Email'])
completeness_age = calculate_completeness(df['Age'])

validity_email = calculate_validity_email(df['Email'])

uniqueness_email = calculate_uniqueness(df['Email'])

print(f"Completeness - Name: {completeness_name:.2f}")
print(f"Completeness - Email: {completeness_email:.2f}")
print(f"Completeness - Age: {completeness_age:.2f}")
print(f"Validity - Email: {validity_email:.2f}")
print(f"Uniqueness - Email: {uniqueness_email}")


Completeness - Name: 0.83
Completeness - Email: 1.00
Completeness - Age: 0.83
Validity - Email: 0.67
Uniqueness - Email: 6


### Task 2: Calculating Data Quality Score
**Description**: Aggregate multiple metrics to calculate an overall data quality score.

**Steps**:
1. Formula: Simple average of all metrics defined in Task 1.

In [2]:
# Write your code from here
# Write your code from here
import pandas as pd

def calculate_completeness(series):
    return series.count() / len(series) if len(series) > 0 else 0.0

def calculate_validity_email(series):
    valid_count = series.astype(str).str.contains('@').sum()
    return valid_count / len(series) if len(series) > 0 else 0.0

def calculate_uniqueness(series):
    return series.nunique() / len(series) if len(series) > 0 else 0.0

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', None],
        'Email': ['alice@example.com', 'bob@example.com', 'charlie@example.com', 'david.example', 'eve@example.com', ''],
        'Age': [25, 30, 22, None, 28, 31]}
df = pd.DataFrame(data)

completeness_name = calculate_completeness(df['Name'])
completeness_email = calculate_completeness(df['Email'])
completeness_age = calculate_completeness(df['Age'])

validity_email = calculate_validity_email(df['Email'])

uniqueness_email = calculate_uniqueness(df['Email'])

data_quality_score = (completeness_name + completeness_email + completeness_age + validity_email + uniqueness_email) / 5

print(f"Completeness - Name: {completeness_name:.2f}")
print(f"Completeness - Email: {completeness_email:.2f}")
print(f"Completeness - Age: {completeness_age:.2f}")
print(f"Validity - Email: {validity_email:.2f}")
print(f"Uniqueness - Email: {uniqueness_email:.2f}")
print(f"Overall Data Quality Score: {data_quality_score:.2f}")

Completeness - Name: 0.83
Completeness - Email: 1.00
Completeness - Age: 0.83
Validity - Email: 0.67
Uniqueness - Email: 1.00
Overall Data Quality Score: 0.87


### Task 3: Creating Expectations for a CSV
**Description**: Develop basic data quality expectations using Great Expectations.

**Steps**:
1. Expectation Suite
2. Define Expectations for Completeness

In [3]:
# Write your code from here
import pandas as pd
from great_expectations.data_context import DataContext
from great_expectations.core.batch import RuntimeBatchRequest
from great_expectations.data_connector import RuntimeDataConnector
from great_expectations.validator.validator import Validator
from great_expectations.core.expectation_configuration import ExpectationConfiguration
from great_expectations.core.expectation_suite import ExpectationSuite
data={'id':[1,2,3,4,5],'product_name':['Laptop','Mouse','Keyboard',None,'Monitor'],'price':[1200.50,25.00,75.20,150.00,300.00],'category':['Electronics','Electronics','Electronics','Peripherals','Electronics'],'quantity_sold':[100,500,200,300,None]}
df=pd.DataFrame(data)
try:
    context=DataContext()
except Exception as e:
    from great_expectations.data_context.types.base import DataContextConfig,DatasourceConfig,ExpectationSuiteConfig
    from great_expectations.data_context.types.resource import DataConnectorConfig
    context_config=DataContextConfig(datasources={"my_in_memory_datasource":DatasourceConfig(class_name="PandasDatasource",module_name="great_expectations.datasource",data_connectors={"runtime_data_connector":DataConnectorConfig(class_name="RuntimeDataConnector",module_name="great_expectations.data_connector",batch_identifiers=["batch_id"],)})},store_backend_defaults={"module_name":"great_expectations.data_context.store","class_name":"InMemoryStoreBackend"},config_variables_file_path=None,anonymous_usage_statistics={"enabled":False},checkpoint_store_name="checkpoint_store",evaluation_parameter_store_name="evaluation_parameter_store",expectations_store_name="expectations_store",validation_operators=None,plugins_directory=None,concurrency_config=None,)
    context=DataContext(project_config=context_config)
expectation_suite_name="product_data_completeness_suite"
suite=context.create_expectation_suite(expectation_suite_name,overwrite_existing=True)
suite.add_expectation(ExpectationConfiguration(expectation_type="expect_column_values_to_not_be_null",column="id"))
suite.add_expectation(ExpectationConfiguration(expectation_type="expect_column_values_to_not_be_null",column="product_name"))
suite.add_expectation(ExpectationConfiguration(expectation_type="expect_column_values_to_not_be_null",column="price"))
suite.add_expectation(ExpectationConfiguration(expectation_type="expect_column_values_to_not_be_null",column="category"))
suite.add_expectation(ExpectationConfiguration(expectation_type="expect_column_values_to_not_be_null",column="quantity_sold"))
context.save_expectation_suite(suite,expectation_suite_name)
batch_request=RuntimeBatchRequest(datasource_name="my_in_memory_datasource",data_connector_name="runtime_data_connector",data_asset_name="my_data_asset",runtime_parameters={"batch_data":df},batch_identifiers={"batch_id":"my_first_batch"},)
validator=context.get_validator(batch_request=batch_request,expectation_suite_name=expectation_suite_name)
validation_result=validator.validate()
print(validation_result.to_json_dict())

ModuleNotFoundError: No module named 'great_expectations'

### Task 4: Running and Validating Expectations
**Description**: Run the created expectations and generate an output report.

**Steps**:
1. Validate
2. Generate HTML Report

In [None]:
# Write your code from here


### Task 5: Automating Data Quality Score Calculation
**Description**: Automate the data quality score via a script that integrates with Great
Expectations.

In [None]:
# Write your code from here


### Task 6: Leveraging Data Quality Metrics for Automated Data Cleaning
**Description**: Implement a system where if data quality metrics fall below a threshold,
automated data cleaning scripts are triggered.

**Steps**:
1. Define Cleaning Logic
2. Integrate with Great Expectations:
    - Use an action within the Great Expectations action list that only triggers if quality score is below a threshold, automating the cleaning.

In [None]:
# Write your code from here
