# Great Expectations tutorial
Welcome! In this tutorial we'll have a look at Great Expectations, a framework that aids you in keeping an eye on your data quality. It provides a batteries-included solution for testing and documenting your data, so that nobody has to run into any surprises when consuming it. To achieve this, you create _expectation suites_. 

**You can think of them as unit tests, but for data.** 

They also double as documentation for your dataset, so that you won't have to repeat yourself.

What do we mean by data quality? Well, bad quality data can happen for different reasons. Usually, data has bad quality if its **structure** (for example the columns and their types in a table) or its **contents** (specific cells in a table) are not what you expected.

For more background on Great Expectations and the problems it solves, we can recommend the authors' blogpost: [Down with Pipeline debt / Introducing Great Expectations](https://medium.com/@expectgreatdata/down-with-pipeline-debt-introducing-great-expectations-862ddc46782a). It's a good read!

## What is Great Expectations (GX) exactly?

<img src='figures/in_out.png' width=800px>

When working with GX you use the following five core components to access, store, and manage underlying objects and processes:
- **Data Context:**  Manages the settings and metadata for a GX project, and provides an entry point to the GX Python API.
- **Data Sources:**  Connects to your Data Source, and organizes retrieved data for future use.
- **Expectations:**  Identifies the standards to which your data should conform.
- **Checkpoints:** Validates a set of Expectations against a specific set of data.
- **Data Docs:**  Creates a web-based documentation site for your data.


## In this tutorial
We'll give you a brief introduction to the main concepts used in Great Expectations, walking you through writing your first expectations and generating your first data report. We have added many references to the official documentation that you can reference to when you are configuring your own setup.

Contents:
- [Data Context](#section-data-context)
- [Data Sources](#section-data-sources)
- [The Expectation Suite](#section-expectation-suite)
- [Checkpoints](#section-checkpoints)
- [Data Docs](#section-data-docs)
- [Actions](#section-actions)
- [Data Assistant](#section-data-assistant)

## Running on Google Colab

If you are running this on Google Colab, make sure to run the cell below to set everything up.

In [None]:
%%bash
if [[ ! -d gx ]]
then 
  git init
  git remote add origin https://github.com/Robin069/tutorial-great-expectations.git
  git pull origin main
  pip install -r requirements.txt
  apt-get install tree
  mkdir data
  python -c "import pandas as pd; pd.read_csv('https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv').to_csv('data/yellow_tripdata_sample_2019-01.csv', index=False); pd.read_csv('https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-02.csv').to_csv('data/yellow_tripdata_sample_2019-02.csv', index=False)"
fi

## Getting started

Let's jump into it then!

In [None]:
import great_expectations as gx
import os
import shutil
if os.path.exists("gx"):
    shutil.rmtree("gx")
print(gx.__version__)

<a id="section-data-context"></a>
## Data Context

In [None]:
context = gx.get_context(project_root_dir=".")

First, let's take a moment to look at the `DataContext`, which represents your Great Expectations setup. It consists of a directory holding configuration files, named `gx` by default.

Note: we are omitting the `uncommitted` directory here. It contains output files (such as rendered data docs), which are not part of the configuration.

In [None]:
!tree gx -nI 'uncommitted'

The main configuration is located in `great_expectations.yml`. We won't go into all the details here, you can refer to the [data context reference](https://docs.greatexpectations.io/docs/conceptual_guides/gx_overview#data-context) for that. 

Instead, we'll just introduce some concepts you'll want to be familiar with:

- A **data source** provides a standard API for accessing and interacting with data from a wide variety of source systems. 

    Great Expectations ships with a number of built-in data sources, including:

    - Pandas
    - SQL
    - Spark
    - CSV
    - Excel
    - BigQuery
    - Snowflake
    - Redshift
    - Postgres
    - MySQL
    - ...

    You can also create your own custom data sources.


    **No matter which Data Source you use, the Data Source's API remains the same.**

- A **data asset** is one dataset that lives in a *data source*, such as an SQL table.
- **stores** can be used to configure how expectation and validation data will be stored. See [configuring metadata stores](https://docs.greatexpectations.io/en/latest/guides/how_to_guides/configuring_metadata_stores.html) if you're interested.

These are all configured in the `great_expectations.yml` file. We'll have a brief look at its contents now, but don't mind it too much, this is here for illustration purposes only.

<img src="figures/data_context_flowchart.png" width=1200px>

In [None]:
!cat gx/great_expectations.yml

<a id="section-data-sources"></a>
## Data Sources

<img src="figures/datasource_flowchart.png" width=1200px>

Next, we load our dataset, `yellow_tripdata_sample_2019-01.csv` and create a Validator object:

In [None]:
validator = context.sources.pandas_default.read_csv(
    "data/yellow_tripdata_sample_2019-01.csv",
)

A `Validator` is the primary object used to configure and run validation of data assets. Validators store information about the data asset they are validating, including the expectations that have been set, and the results of those validations. 

One validation run can include multiple batches and expectation suites. This way, it is possible to test multiple files in the same run. Compare this to how one run of your test suite can test multiple software modules.

In [None]:
validator.head()

This is the documentation that came with the data:
 - **vendor_id** - A code indicating the TPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.
 - **pickup_datetime** - The date and time when the meter was engaged.
 - **dropoff_datetime** - The date and time when the meter was disengaged.
 - **passenger_count** - The number of passengers in the vehicle. This is a driver-entered value.
 - **trip_distance** - The elapsed trip distance in miles reported by the taximeter.
 - **rate_code_id** - The final rate code in effect at the end of the trip. 1= Standard rate, 2=JFK, 3=Newark, 4=Nassau or Westchester, 5=Negotiated fare, 6=Group ride
 - **store_and_fwd_flag** - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server. Y= store and forward trip, N= not a store and forward trip
 - **pickup_location_id** - TLC Taxi Zone in which the taximeter was engaged
 - **dropoff_location_id** - TLC Taxi Zone in which the taximeter was disengaged
 - **payment_type** - A numeric code signifying how the passenger paid for the trip. 1= Credit card, 2= Cash, 3= No charge, 4= Dispute, 5= Unknown, 6= Voided trip
 - **fare_amount** - The time-and-distance fare calculated by the meter.
 - **extra** - Miscellaneous extras and surcharges. Currently, this only includes the \$0.50 and \$1 rush hour and overnight charges.
 - **mta_tax** - \$0.50 MTA tax that is automatically triggered based on the metered rate in use.
 - **tip_amount** - This field is automatically populated for credit card tips. Cash tips are not included.
 - **tolls_amount** - Total amount of all tolls paid in trip.
 - **improvement_surcharge** - \$0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015.
 - **total_amount** - The total amount charged to passengers. Does not include cash tips.
 - **congestion_surcharge** - $2.50 surcharge for all trips that begin, end or pass through the Manhattan exclusionary zone.


<a id="section-expectation-suite"></a>
## The Expectation Suite

These descriptions sure help us to understand the dataset a bit better, but they don't exactly provide much guarantees. When consuming this dataset, what expectations can we have? Will the `passenger_count` field always be specified? Will the `dropoff_datetime` field always be in the same format? Is the total amount always positive?

Great Expectations helps us to codify these properties in a set of `Expectations`. An `Expectation` is something that you expect to be true in your data. Again, think of it as an unit test for your dataset.

**An Expectation is a verifiable assertion about data.** Expectations enhance communication about your data and improve quality for data applications. They help you take the implicit assumptions about your data and make them explicit.

There are many built-in Expectations, see https://greatexpectations.io/expectations/ for a full list. 

In [None]:
validator.expect_column_values_to_not_be_null("pickup_datetime")
validator.expect_column_values_to_be_between("passenger_count", auto=True)


Let's save our Expectations to a file so we can reuse them later. Expectations are stored in the *expectation store*, which by default is the `expectations` folder inside your configuration, but you can use other storage backends as well, such as a SQL database or cloud storage (S3, Azure Blob Storage or GCS). 

In [None]:
validator.save_expectation_suite()

<a id="section-checkpoints"></a>
## Checkpoints

We can now run the Expectation Suite against our data. For this, we'll use a `Checkpoint`. A Checkpoint is a collection of Expectations and a Data Asset. It is a way to package up a test suite and apply it to a data asset.

In [None]:
checkpoint = context.add_or_update_checkpoint(
    name="checkpoint_1",
    validator=validator,
)
checkpoint_result = checkpoint.run()

In [None]:
checkpoint_result

Checkpoint runs are the primary way to use Great Expectations in automated workflows. They are designed to be used in automated data quality pipelines, and can be run from the command line, from notebooks, or from any other Python code. They produce a JSON-formatted validation result document that can be used for further analysis or as a report.

<img src="figures/checkpoint_flowchart.png" width=1200px>

Now that we covered the basics, let's get to some fancier expectations. For example, we could make sure that all date columns are in the expected format:

In [None]:
validator.expect_column_values_to_match_strftime_format('pickup_datetime', "%Y-%m-%d %H:%M:%S")
validator.expect_column_values_to_match_strftime_format('dropoff_datetime', "%Y-%m-%d %H:%M:%S")

In [None]:
validator.save_expectation_suite()
checkpoint = context.add_or_update_checkpoint(
    name="checkpoint_2",
    validator=validator,
)

checkpoint_result = checkpoint.run()

All the expectations we've implemented so far have been met by our data. But what happens if we run into a problem? Let's try it out by adding a new expectation that checks if the `vendor_id` is always 1 or 2:

In [None]:
validator.expect_column_values_to_be_in_set('vendor_id', [1, 2])

That failed. Great Expectations helpfully collected the values which do not match the expectation for us. By default, it will collect up to 20 examples of values that didn't meet the expectation (that's why it's called the _partial_ unexpected list). 

In the Data Docs, you can see that the expectation failed, and you can also see the unexpected values. This is very useful when you are trying to debug your data.

In [None]:
validator.save_expectation_suite(discard_failed_expectations=False)
checkpoint = context.add_or_update_checkpoint(
    name="checkpoint_3",
    validator=validator,
)

checkpoint_result = checkpoint.run()

<a id="section-data-docs"></a>
## Data Docs

We can render these results to a friendly report, called a `Data Doc`. These data docs will describe the expectations that the data should meet, as well as the metrics detailing how well the data meets the requirements. This is how Great Expectations combines testing with documenting. Data Docs are essentially static HTML pages that can be hosted on any web server or cloud storage provider.

We already built the Data Docs using the `Validator` object in the previous section. Let's check them out now!

In [None]:
context.open_data_docs()

<a id="section-actions"></a>
## Actions

One of the most powerful features of Checkpoints is that you can configure them to run Actions. The Validation Results generated when a Checkpoint runs determine what Actions are performed. Typical use cases include sending email, Slack messages, or custom notifications. 

Actions can be used to do anything you are capable of programming in Python. Actions are a versatile tool for integrating Checkpoints in your pipeline's workflow.

<img src="figures/actions.png" width=800px>

<a id="section-data-assistant"></a>
## Data Assistant

In the previous sections we explored how we could get some metrics about our data using expectations by defining them ourselves. This requires some knowledge about the data, and some effort to write the expectations.

But what if you don't know what exactly to expect of your data? This is where the `Data Assistant` comes in.

A Data Assistant is a pre-configured utility that simplifies the creation of Expectations. A Data Assistant can help you determine a starting point when working with a large, new, or complex dataset by asking questions and then building a list of relevant Metrics from the answers to those questions.

In [None]:
if os.path.exists("gx"):
    shutil.rmtree("gx")
try:
    del context
except NameError:
    pass

In [None]:
context = gx.get_context(project_root_dir=".")

In [None]:
context.sources.add_pandas_filesystem(
    "taxi_multi_batch_datasource",
    base_directory="./data",  # replace with your data directory
).add_csv_asset(
    "all_years",
    batching_regex=r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv",
)

all_years_asset = context.datasources[
    "taxi_multi_batch_datasource"
].get_asset("all_years")

multi_batch_all_years_batch_request = (
    all_years_asset.build_batch_request()
)

In [None]:
expectation_suite_name = "my_onboarding_assistant_suite"

expectation_suite = context.add_or_update_expectation_suite(
    expectation_suite_name=expectation_suite_name
)

In [None]:
data_assistant_result = context.assistants.onboarding.run(
    batch_request=multi_batch_all_years_batch_request,
)

In [None]:
expectation_suite = data_assistant_result.get_expectation_suite(
    expectation_suite_name=expectation_suite_name
)

context.add_or_update_expectation_suite(expectation_suite=expectation_suite)

In [None]:
checkpoint = context.add_or_update_checkpoint(
    name=f"yellow_tripdata_sample_{expectation_suite_name}",
    validations=[
        {
            "batch_request": multi_batch_all_years_batch_request,
            "expectation_suite_name": expectation_suite_name,
        }
    ]
)
checkpoint_result = checkpoint.run()

In [None]:
context.open_data_docs()

Run the following code to view Batch-level visualizations of the Metrics computed by the Onboarding Data Assistant:

In [None]:
data_assistant_result.plot_metrics()

In [None]:
data_assistant_result.plot_expectations_and_metrics()

This is a very powerful tool, and it can help you to get started with your data. However, it is not a silver bullet. It is still up to you to decide which metrics are important, and which are not. For example, the Data Assistant will not tell you that the `vendor_id` field should only contain the values 1 and 2. It will only tell you that the field is categorical, and that it contains 2 unique values. It is up to you to decide whether this is what you expect.

## Closing remarks

This concludes our tutorial. We hope you enjoyed it, and that you are now ready to start using Great Expectations in your own projects! 

**Any feedback is welcome!** If you have any questions, remarks, or suggestions, please let us know! :) 