## Automate Data Quality Checks with Great Expectations
**Introduction**: In this activity, you will learn how to automate data quality checks using the Great Expectations framework. This includes setting up expectations and generating validation reports.

### Task 1: Setup and Initial Expectations

1. Objective: Set up Great Expectations and create initial expectations for a dataset.
2. Steps:
    - Install Great Expectations using pip.
    - Initialize a data context.
    - Create basic expectations on a sample dataset.
    - Eg., Implement a basic setup and expectation for column presence and type.

In [None]:
# Write your code from here
!pip install great-expectations


Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
!great_expectations init


/bin/bash: line 1: great_expectations: command not found


In [3]:
!great_expectations datasource new


/bin/bash: line 1: great_expectations: command not found


In [4]:
!great_expectations suite new


/bin/bash: line 1: great_expectations: command not found


In [5]:
!great_expectations suite edit sample_suite


/bin/bash: line 1: great_expectations: command not found


In [6]:
!great_expectations checkpoint new my_checkpoint


/bin/bash: line 1: great_expectations: command not found


In [7]:
!great_expectations checkpoint run my_checkpoint


/bin/bash: line 1: great_expectations: command not found


In [8]:
# In the expectation suite notebook
batch.expect_column_to_exist("id")
batch.expect_column_values_to_be_of_type("id", "int64")
batch.expect_column_values_to_not_be_null("id")


NameError: name 'batch' is not defined

### Task 2: Validate Datasets and Generate Reports

1. Objective: Validate a dataset against defined expectations and generate a report.
2. Steps:
    - Execute the validation process on the dataset.
    - Review the validation results and generate a report.
    - Eg., Validate completeness and consistency expectations, and view the results.


In [9]:
# Write your code from here
!great_expectations suite list


/bin/bash: line 1: great_expectations: command not found


In [10]:
!great_expectations suite edit sample_suite


/bin/bash: line 1: great_expectations: command not found


In [11]:
!great_expectations docs build


/bin/bash: line 1: great_expectations: command not found


In [12]:
!great_expectations/uncommitted/data_docs/local_site/index.html


/bin/bash: line 1: great_expectations/uncommitted/data_docs/local_site/index.html: No such file or directory


In [13]:
batch.expect_column_values_to_not_be_null("email")
batch.expect_column_values_to_be_unique("user_id")
batch.expect_column_values_to_match_regex("email", r"[^@]+@[^@]+\.[^@]+")


NameError: name 'batch' is not defined

### Task 3: Advanced Expectations and Scheduling

1. Objective: Create advanced expectations for conditional checks and automate the validation.
2. Steps:
    - Define advanced expectations based on complex conditions.
    - Use scheduling tools to automate periodic checks.
    - E.g., an expectation that customer IDs must be unique and schedule a daily check.

In [14]:
# Write your code from here
!great_expectations suite edit sample_suite


/bin/bash: line 1: great_expectations: command not found


In [15]:
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime

with DAG('daily_data_validation',
         start_date=datetime(2024, 1, 1),
         schedule_interval='@daily',
         catchup=False) as dag:

    validate_data = BashOperator(
        task_id='run_great_expectations',
        bash_command='great_expectations checkpoint run my_checkpoint'
    )


ModuleNotFoundError: No module named 'airflow'