### Task 1: Understanding and Defining Data Quality Metrics
**Description**: Learn how to define basic data quality metrics such as completeness, validity, and uniqueness for a simple dataset.

**Steps**:
1. Dataset: Use a CSV with columns like Name , Email , Age .
2. Metric Definitions:
    - Completeness: Percentage of non-null values.
    - Validity: % of email fields containing @ .
    - Uniqueness: Count distinct entries in the Email column.

In [44]:
# Write your code from here
import pandas as pd

# Create a sample dataset with Name, Email, and Age
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Email': ['alice@example.com', 'bob@example', 'charlie@example.com', 'david@example.com', ''],
    'Age': [30, 25, 35, 40, 50]
}

# Convert the dataset into a pandas DataFrame
df = pd.DataFrame(data)

# Write the DataFrame to a CSV file
df.to_csv('students.csv', index=False)

print("CSV file 'students.csv' has been created successfully.")


CSV file 'students.csv' has been created successfully.


In [45]:
import pandas as pd

# Load dataset from CSV
df = pd.read_csv('students.csv')

# 1. Completeness: Percentage of non-null values
def calculate_completeness(df):
    completeness = (df.notnull().mean()) * 100
    return completeness

# 2. Validity: % of email fields containing '@'
def calculate_validity(df):
    valid_email = df['Email'].str.contains('@', na=False).mean() * 100
    return valid_email

# 3. Uniqueness: Count distinct entries in the Email column
def calculate_uniqueness(df):
    unique_emails = df['Email'].nunique()
    return unique_emails

# Calculate metrics
completeness = calculate_completeness(df)
validity = calculate_validity(df)
uniqueness = calculate_uniqueness(df)

# Display results
print("Data Quality Metrics:")
print(f"Completeness: {completeness}")
print(f"Validity: {validity}%")
print(f"Uniqueness: {uniqueness} unique email addresses")


Data Quality Metrics:
Completeness: Name     100.0
Email     80.0
Age      100.0
dtype: float64
Validity: 80.0%
Uniqueness: 4 unique email addresses


### Task 2: Calculating Data Quality Score
**Description**: Aggregate multiple metrics to calculate an overall data quality score.

**Steps**:
1. Formula: Simple average of all metrics defined in Task 1.

In [46]:
# Write your code from here
import pandas as pd

# Load the dataset
df = pd.read_csv('students.csv')

# Metric 1: Completeness (Percentage of non-null values)
def completeness(df):
    total_values = df.size
    non_null_values = df.notnull().sum().sum()
    return (non_null_values / total_values) * 100

# Metric 2: Validity (Percentage of valid emails containing '@')
def validity(df):
    valid_emails = df['Email'].apply(lambda x: '@' in str(x)).sum()
    return (valid_emails / len(df)) * 100

# Metric 3: Uniqueness (Percentage of unique emails)
def uniqueness(df):
    unique_emails = df['Email'].nunique()
    return (unique_emails / len(df)) * 100

# Calculate the overall Data Quality Score (Simple Average of the metrics)
def calculate_data_quality_score(df):
    completeness_score = completeness(df)
    validity_score = validity(df)
    uniqueness_score = uniqueness(df)
    
    # Simple average of all metrics
    overall_dqi = (completeness_score + validity_score + uniqueness_score) / 3
    return overall_dqi

# Calculate and print the Data Quality Score
data_quality_score = calculate_data_quality_score(df)
print(f"Overall Data Quality Score: {data_quality_score:.2f}")


Overall Data Quality Score: 84.44


### Task 3: Creating Expectations for a CSV
**Description**: Develop basic data quality expectations using Great Expectations.

**Steps**:
1. Expectation Suite
2. Define Expectations for Completeness

In [47]:
import pandas as pd
import great_expectations as ge
from great_expectations.core import ExpectationSuite
from great_expectations.dataset import PandasDataset
from great_expectations.data_context import DataContext

# Load the dataset using pandas
df = pd.read_csv('students.csv')

# Create a DataContext
context = DataContext()

# Convert pandas DataFrame to Great Expectations PandasDataset
df_ge = ge.dataset.PandasDataset(df)

# Create an expectation suite
suite = ExpectationSuite("students_suite")

# Add expectations to check for completeness (non-null values) and valid emails
df_ge.expect_column_values_to_not_be_null("Email")
df_ge.expect_column_values_to_not_be_null("Age")
df_ge.expect_column_value_lengths_to_be_between("Email", min_value=5, max_value=100)

# Add an expectation for valid emails (should contain '@')
df_ge.expect_column_values_to_match_like('Email', r'.+@.+\..+')

# Validate the dataset against the expectations
results = df_ge.validate(expectation_suite=suite)

# Print the results
print(results)


ModuleNotFoundError: No module named 'great_expectations.dataset'

### Task 4: Running and Validating Expectations
**Description**: Run the created expectations and generate an output report.

**Steps**:
1. Validate
2. Generate HTML Report

In [None]:
# Write your code from here
import pandas as pd
import great_expectations as ge
from great_expectations.core import ExpectationSuite
from great_expectations.data_context import DataContext
import os

# Load the dataset using pandas
df = pd.read_csv('students.csv')

# Create a DataContext
context = DataContext()

# Convert pandas DataFrame to Great Expectations PandasDataset
df_ge = ge.dataset.PandasDataset(df)

# Create an expectation suite
suite = ExpectationSuite("students_suite")

# Add expectations to check for completeness (non-null values) and valid emails
df_ge.expect_column_values_to_not_be_null("Email")
df_ge.expect_column_values_to_not_be_null("Age")
df_ge.expect_column_value_lengths_to_be_between("Email", min_value=5, max_value=100)

# Add an expectation for valid emails (should contain '@')
df_ge.expect_column_values_to_match_like('Email', r'.+@.+\..+')

# Validate the dataset against the expectations
results = df_ge.validate(expectation_suite=suite)

# Print the results
print(results)

# Generate the HTML report
context.build_data_docs()

# Assuming the data docs folder is generated in your context directory
# We can access the report from the data_docs folder
report_path = os.path.join(context.root_directory, "uncommitted/data_docs/local_site/index.html")

print(f"HTML report generated at: {report_path}")



ImportError: cannot import name 'DataContext' from 'great_expectations.data_context' (/home/vscode/.local/lib/python3.10/site-packages/great_expectations/data_context/__init__.py)

### Task 5: Automating Data Quality Score Calculation
**Description**: Automate the data quality score via a script that integrates with Great
Expectations.

In [None]:
import pandas as pd
import great_expectations as ge
from great_expectations.data_context import DataContext

# Load the dataset using pandas
df = pd.read_csv('students.csv')

# Initialize the Great Expectations DataContext
context = DataContext()

# Step 1: Define Cleaning Logic (example)
def clean_data(df):
    """
    Cleaning logic to fill missing values in Age with the mean.
    """
    df['Age'].fillna(df['Age'].mean(), inplace=True)
    # Additional cleaning logic as per requirement
    return df

# Step 2: Create and Load Expectation Suite
# Create an expectation suite for data quality expectations (can be done using the CLI, assuming suite exists)
suite_name = "students_suite"
context.create_expectation_suite(suite_name, overwrite_existing=True)

# Define expectations for completeness and email validity
batch = ge.dataset.PandasDataset(df)

# Add expectations for each column
batch.expect_column_values_to_be_in_set('Age', range(0, 120))  # Valid age range 0-120
batch.expect_column_values_to_be_in_set('Grade', range(0, 101))  # Valid grade range 0-100
batch.expect_column_values_to_match_like_pattern('Email', r'.+@.+\..+')  # Valid email format

# Step 3: Calculate Data Quality Score
def calculate_dqi(df):
    completeness = df.notnull().mean().mean() * 100  # Completeness as percentage of non-null values
    valid_email = df['Email'].str.contains(r'.+@.+\..+').mean() * 100  # Valid email check
    uniqueness = df['Email'].nunique() / len(df) * 100  # Uniqueness as percentage of unique email entries
    
    # Simple average of all metrics for DQI
    dqi = (completeness + valid_email + uniqueness) / 3
    return dqi

# Step 4: Trigger Data Cleaning Based on DQI Threshold
def check_and_clean(df, threshold=85):
    dqi = calculate_dqi(df)
    print(f"Current Data Quality Index (DQI): {dqi}")
    
    if dqi < threshold:
        print(f"Data quality is below the threshold of {threshold}. Triggering cleaning process.")
        cleaned_df = clean_data(df)
        cleaned_dqi = calculate_dqi(cleaned_df)
        print(f"New Data Quality Index (DQI) after cleaning: {cleaned_dqi}")
        return cleaned_df
    else:
        print(f"Data quality is sufficient. No cleaning required.")
        return df

# Step 5: Run the Cleaning Logic
cleaned_df = check_and_clean(df)

# Optionally, you can save the cleaned dataframe to a new CSV
cleaned_df.to_csv('students_cleaned.csv', index=False)



ImportError: cannot import name 'DataContext' from 'great_expectations.data_context' (/home/vscode/.local/lib/python3.10/site-packages/great_expectations/data_context/__init__.py)

### Task 6: Leveraging Data Quality Metrics for Automated Data Cleaning
**Description**: Implement a system where if data quality metrics fall below a threshold,
automated data cleaning scripts are triggered.

**Steps**:
1. Define Cleaning Logic
2. Integrate with Great Expectations:
    - Use an action within the Great Expectations action list that only triggers if quality score is below a threshold, automating the cleaning.

In [None]:
# Write your code from here
import pandas as pd
import numpy as np

# Define a function for removing rows with missing values
def clean_missing_values(df):
    print("Cleaning missing values...")
    return df.dropna()

# Define a function to fix invalid email formats
def clean_invalid_emails(df):
    print("Cleaning invalid emails...")
    df['Email'] = df['Email'].apply(lambda x: x if '@' in str(x) else 'invalid_email@example.com')
    return df

# Define a function for removing outliers in the 'Age' column (e.g., age > 120 is considered an outlier)
def clean_outliers(df):
    print("Cleaning outliers in Age...")
    return df[df['Age'] <= 120]

# Define the main cleaning function that integrates all the cleaning logic
def clean_data(df):
    df = clean_missing_values(df)
    df = clean_invalid_emails(df)
    df = clean_outliers(df)
    return df

