# Great Expectations tutorial
Welcome! In this tutorial we'll have a look at Great Expectations, a tool written and configured in Python that aids you in keeping an eye on your data quality. It provides a batteries-included solution for testing and documenting your data, so that nobody has to run into any surprises when consuming it. To achieve this, you create _expectation suites_. You can think of them as unit tests, but for data. They also double as documentation for your dataset, so that you won't have to repeat yourself.

What do we mean by data quality? Well, bad quality data can happen for different reasons. Usually, data has bad quality if its structure (for example the columns and their types in a table) or its contents (specific cells in a table) are not what you expected.

For more background on Great Expectations and the problems it solves, we can recommend the authors' blogpost: [Down with Pipeline debt / Introducing Great Expectations](https://medium.com/@expectgreatdata/down-with-pipeline-debt-introducing-great-expectations-862ddc46782a). It's a good read!

## What is Great Expectations (GX) exactly?

<img src='figures/in_out.png' width=800px>

When working with GX you use the following four core components to access, store, and manage underlying objects and processes:
- **Data Context:**  Manages the settings and metadata for a GX project, and provides an entry point to the GX Python API.
- **Data Sources:**  Connects to your Data Source, and organizes retrieved data for future use.
- **Expectations:**  Identifies the standards to which your data should conform.
- **Checkpoints:** Validates a set of Expectations against a specific set of data.


## In this tutorial
We'll give you a brief introduction to the main concepts used in Great Expectations, walking you through writing your first expectations and generating your first data report. We have added many references to the official documentation that you can reference to when you are configuring your own setup.

Contents:
- [Data Context](#section-data-context)
- [The Expectation Suite](#section-expectation-suite)
- [Data Docs](#section-data-docs)
- [Checkpoints](#section-checkpoints)
- [Data Profiling](#section-profiling)
- [The Great Expectations CLI](#section-cli)

## Running on Google Colab

If you are running this on Google Colab, make sure to run the cell below to set everything up.

In [None]:
%%bash
if [[ ! -d gx ]]
then 
  git init
  git remote add origin https://github.com/datarootsio/tutorial-great-expectations.git
  git pull origin main
  pip install -r requirements.txt
  apt-get install tree
fi

## Getting started

Let's jump into it then!

In [16]:
import great_expectations as gx
print(gx.__version__)

0.18.1


<a id="section-data-context"></a>
## Data Context

In [17]:
context = gx.get_context(project_root_dir=".")

Before we move on, let's take a moment to look at the `DataContext`, which represents your Great Expectations setup. It consists of a directory holding configuration files, named `gx` by default.

Note: we are omitting the `uncommitted` directory here. It contains output files (such as rendered data docs), which are not part of the configuration.

In [36]:
!tree gx -nI 'uncommitted'

Too many parameters - -nI


The main configuration is located in `great_expectations.yml`. We won't go into all the details here, you can refer to the [data context reference](https://docs.greatexpectations.io/en/latest/reference/spare_parts/data_context_reference.html) for that. Instead, we'll just introduce some concepts you'll want to be familiar with:

- A **data source** is something that can provide data to Great Expectations, such as an SQL database.
- A **data asset** is one dataset that lives in a *data source*, such as an SQL table.
- **stores** can be used to configure how expectation and validation data will be stored. See [configuring metadata stores](https://docs.greatexpectations.io/en/latest/guides/how_to_guides/configuring_metadata_stores.html) if you're interested.

These are all configured in the `great_expectations.yml` file. We'll have a brief look at its contents now, but don't mind it too much, this is here for illustration purposes only.

<img src="figures/data_context_flowchart.png" width=1500px>

In [37]:
!cat gx/great_expectations.yml

'cat' is not recognized as an internal or external command,
operable program or batch file.


Next, we load our dataset, `yellow_tripdata_sample_2019-01.csv` and create a Validator object:

In [18]:
validator = context.sources.pandas_default.read_csv(
    "https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv"
)

A `Data Source` provides a standard API for accessing and interacting with data from a wide variety of source systems. 

Great Expectations ships with a number of built-in data sources, including:

- Pandas
- SQL
- Spark
- CSV
- Excel
- BigQuery
- Snowflake
- Redshift
- Postgres
- MySQL
- ...

You can also create your own custom data sources.


**No matter which Data Source you use, the Data Source's API remains the same.**

Let's look at our data:

In [19]:
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,rate_code_id,store_and_fwd_flag,pickup_location_id,dropoff_location_id,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,1,2019-01-15 03:36:12,2019-01-15 03:42:19,1,1.0,1,N,230,48,1,6.5,0.5,0.5,1.95,0.0,0.3,9.75,
1,1,2019-01-25 18:20:32,2019-01-25 18:26:55,1,0.8,1,N,112,112,1,6.0,1.0,0.5,1.55,0.0,0.3,9.35,0.0
2,1,2019-01-05 06:47:31,2019-01-05 06:52:19,1,1.1,1,N,107,4,2,6.0,0.0,0.5,0.0,0.0,0.3,6.8,
3,1,2019-01-09 15:08:02,2019-01-09 15:20:17,1,2.5,1,N,143,158,1,11.0,0.0,0.5,3.0,0.0,0.3,14.8,
4,1,2019-01-25 18:49:51,2019-01-25 18:56:44,1,0.8,1,N,246,90,1,6.5,1.0,0.5,1.65,0.0,0.3,9.95,0.0


This is the documentation that came with the data:
 - vendor_id - A code indicating the TPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.
 - pickup_datetime - The date and time when the meter was engaged.
 - dropoff_datetime - The date and time when the meter was disengaged.
 - passenger_count - The number of passengers in the vehicle. This is a driver-entered value.
 - trip_distance - The elapsed trip distance in miles reported by the taximeter.
 - rate_code_id - The final rate code in effect at the end of the trip. 1= Standard rate, 2=JFK, 3=Newark, 4=Nassau or Westchester, 5=Negotiated fare, 6=Group ride
 - store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server. Y= store and forward trip, N= not a store and forward trip
 - pickup_location_id - TLC Taxi Zone in which the taximeter was engaged
 - dropoff_location_id - TLC Taxi Zone in which the taximeter was disengaged
 - payment_type - A numeric code signifying how the passenger paid for the trip. 1= Credit card, 2= Cash, 3= No charge, 4= Dispute, 5= Unknown, 6= Voided trip
 - fare_amount - The time-and-distance fare calculated by the meter.
 - extra - Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges.
 - mta_tax - $0.50 MTA tax that is automatically triggered based on the metered rate in use.
 - tip_amount - This field is automatically populated for credit card tips. Cash tips are not included.
 - tolls_amount - Total amount of all tolls paid in trip.
 - improvement_surcharge - $0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015.
 - total_amount - The total amount charged to passengers. Does not include cash tips.
 - congestion_surcharge - $2.50 surcharge for all trips that begin, end or pass through the Manhattan exclusionary zone.
 
These descriptions sure help us to understand the dataset a bit better, but they don't exactly provide much guarantees. When consuming this dataset, what expectations can we have? Will the `passenger_count` field always be specified? Will the `dropoff_datetime` field always be in the same format? Is the total amount always positive?

Great Expectations helps us to codify these properties in a set of `Expectations`. An `Expectation` is something that you expect to be true in your data. Again, think of it as an unit test for your dataset.

Now that we have our `DataContext` ready, we can add `Expectations`. An Expectation is a verifiable assertion about data. Expectations enhance communication about your data and improve quality for data applications. They help you take the implicit assumptions about your data and make them explicit.


There are many built-in Expectations, see https://greatexpectations.io/expectations/ for a full list. 

The first Expectation uses domain knowledge (the `pickup_datetime` shouldn't be null).

The second Expectation uses `auto=True` to detect a range of values in the passenger_count column. This combines Data Profiling with Data Testing, more on Data Profiling later.

In [20]:
validator.expect_column_values_to_not_be_null("pickup_datetime")
validator.expect_column_values_to_be_between("passenger_count", auto=True)


Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]




Generating Expectations:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_between",
    "kwargs": {
      "min_value": 1,
      "max_value": 6,
      "strict_max": false,
      "mostly": 1.0,
      "strict_min": false,
      "column": "passenger_count"
    },
    "meta": {
      "auto_generated_at": "20231109T140437.149996Z",
      "great_expectations_version": "0.18.1"
    }
  },
  "result": {
    "element_count": 10000,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

Let's save our Expectations to a file so we can reuse them later.

In [21]:
validator.save_expectation_suite()

We can now run these Expectations against our data. For this, we'll use a `Checkpoint`. A Checkpoint is a collection of Expectations and a Data Asset. It is a way to package up a test suite and apply it to a data asset.

In [22]:
checkpoint = context.add_or_update_checkpoint(
    name="checkpoint_1",
    validator=validator,
)
checkpoint_result = checkpoint.run()

In [24]:
checkpoint_result

{
  "run_id": {
    "run_name": null,
    "run_time": "2023-11-09T15:04:45.485392+01:00"
  },
  "run_results": {
    "ValidationResultIdentifier::default/__none__/20231109T140445.485392Z/default_pandas_datasource-#ephemeral_pandas_asset": {
      "validation_result": {
        "success": true,
        "results": [
          {
            "success": true,
            "expectation_config": {
              "expectation_type": "expect_column_values_to_not_be_null",
              "kwargs": {
                "column": "pickup_datetime",
                "batch_id": "default_pandas_datasource-#ephemeral_pandas_asset"
              },
              "meta": {}
            },
            "result": {
              "element_count": 10000,
              "unexpected_count": 0,
              "unexpected_percent": 0.0,
              "partial_unexpected_list": [],
              "partial_unexpected_counts": [],
              "partial_unexpected_index_list": []
            },
            "meta": {},
     

Checkpoint runs are the primary way to use Great Expectations in automated workflows. They are designed to be used in automated data quality pipelines, and can be run from the command line, from notebooks, or from any other Python code. They produce a JSON-formatted validation result document that can be used for further analysis or as a report. Data Docs are also updated with the results of Checkpoint runs and can be used to monitor data quality over time. Data Docs are saved as static HTML files and can be easily shared with others and viewed in any web browser.

In [25]:
context.open_data_docs()

Now that we covered the basics, let's get to some fancier expectations. For example, we could make sure that all date columns are in the expected format:

In [26]:
validator.expect_column_values_to_match_strftime_format('pickup_datetime', "%Y-%m-%d %H:%M:%S")
validator.expect_column_values_to_match_strftime_format('dropoff_datetime', "%Y-%m-%d %H:%M:%S")

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 10000,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [27]:
validator.save_expectation_suite()
checkpoint = context.add_or_update_checkpoint(
    name="checkpoint_2",
    validator=validator,
)

checkpoint_result = checkpoint.run()

Calculating Metrics:   0%|          | 0/27 [00:00<?, ?it/s]

All the expectations we've implemented so far have been met by our data. But what happens if we run into a problem? Let's try it out by adding a new expectation that checks if the `vendor_id` is always 1 or 2:

In [29]:
validator.expect_column_values_to_be_in_set('vendor_id', [1, 2])

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": false,
  "result": {
    "element_count": 10000,
    "unexpected_count": 96,
    "unexpected_percent": 0.96,
    "partial_unexpected_list": [
      4,
      4,
      4,
      4,
      4,
      4,
      4,
      4,
      4,
      4,
      4,
      4,
      4,
      4,
      4,
      4,
      4,
      4,
      4,
      4
    ],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.96,
    "unexpected_percent_nonmissing": 0.96
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

That failed. Great Expectations helpfully collected the values which do not match the expectation for us. By default, it will collect up to 20 examples of values that didn't meet the expectation (that's why it's called the _partial_ unexpected list). 

In the Data Docs, you can see that the expectation failed, and you can also see the unexpected values. This is very useful when you are trying to debug your data.

In [15]:
validator.save_expectation_suite(discard_failed_expectations=False)
checkpoint = context.add_or_update_checkpoint(
    name="checkpoint_3",
    validator=validator,
)

checkpoint_result = checkpoint.run()

Calculating Metrics:   0%|          | 0/39 [00:00<?, ?it/s]

You can check out the [glossary of expectations](https://docs.greatexpectations.io/en/latest/reference/glossary_of_expectations.html) for a complete list of what you can do.

<a id="section-expectation-suite"></a>
## The Expectation Suite

So, while we were experimenting up there, great_expectations remembered all the expectations we ran. We can now retrieve the suite contents as follows:

In [None]:
batch.get_expectation_suite()

That gave us the `dict` representation Great Expectations uses under the hood to keep track of our exepectation suite. Can you recognise some of the expectations we wrote?

An expectation suite is just a sequence of expectations, as shown below.
<img src="figures/expectation_suite.png">
This representation can then be saved to a file, so that we can load it again at another time, without depending on the python code that produced it.

Note that by default, expectations that failed on the `batch` we ran them against will be omitted. If you want to include them anyways, you could add the `discard_failed_expectations=False` parameter.

In [None]:
batch.save_expectation_suite()

What did that command do? Let's open up our configuration folder to try and find our expectation suite.

In [None]:
!tree great_expectations -nI "uncommitted"

We will get back to the configuration in a minute [[Data Context]](#section-data-context), so don't get confused about this yet.

As you can see, the `save_expectation_suite` command saved our `check_avocado_data` suite to the `expectations` folder. That's all there is to it, the expectation suite is just a json file. It contains that same internal representation that we retrieved from `get_expectation_suite()`. You can check it out if you like.

In [None]:
!cat great_expectations/expectations/check_avocado_data.json

Expectations are stored in the *expectation store*, which by default is the `expectations` folder inside your configuration, but you can use other storage backends as well, such as a SQL database or cloud storage (S3, Azure Blob Storage or GCS). See [metadata stores](https://docs.greatexpectations.io/en/latest/guides/how_to_guides/configuring_metadata_stores.html) for more information.

<a id="validation-results"></a>
Now that we added our expectation suite to our `DataContext`, we can try running the entire suite.
Validiating your data against an expectation suite is done by running a **validation operator**. A validation operator describes what should be done with your validation results. Here, we would like to store the results on disk, and generate a friendly report on them. We'll show you how this is configured in the [Data Context section](#section-data-context), in the meanwhile we'll use `my_validation_operator`, which we shipped with the configuration.


In [None]:
results = context.run_validation_operator('my_validation_operator', assets_to_validate=[batch])

One validation run can include multiple batches and expectation suites. This way, it is possible to test multiple files in the same run. Compare this to how one run of your test suite can test multiple software modules.

We didn't explicitly specify the expectation suite to use with our data batch, because `batch` keeps track of the expectation suite for us. We already saw this when we retrieved the suite from it at the beginning of this section.

Now that we got through that, let's have a look at the results.

In [None]:
results

This is called a *validation result*. Validation results are kept in the *validation store*, which is the `great_expectations/uncommitted/validations` directory by default.

In [None]:
!tree -n great_expectations/uncommitted/validations

Great Expectations also allows you to set other backends as a validation store, such as your favourite cloud storage offering, or a SQL database. Check out [metadata stores](https://docs.greatexpectations.io/en/latest/guides/how_to_guides/configuring_metadata_stores.html) if you would like to learn more!

<a id="section-data-docs"></a>
## Data Docs

We can render these results to a friendly report, called a data doc. These data docs will describe the expectations that the data should meet, as well as the metrics detailing how well the data meets the requirements. This is how Great Expectations combines testing with documenting.

Remember that we already built the data docs using `my_validation_operator` in the previous section. Let's check them out now! We'll take you to the index page, make sure to browse around for a bit. In the `Validation Results` tab you'll find the validation run we ran above. Click it for a friendly report on its results. In the `Expectation Suites` tab, you can find a document detailing the expectations set by our `check_avocado_data` suite. You'll see the expectations we ran above reflected in the different sections.

If you are running the tutorial on your OS, run this command to open the data docs:

In [None]:
context.open_data_docs()

If you're running on docker, try this link [here](/view/great_expectations/uncommitted/data_docs/local_site/index.html). If the links don't work in your browser, you could try using the [jupyter file browser](/tree/great_expectations/uncommitted/data_docs/local_site). It's not ideal, but it works. 

Otherwise, you can view the results for our run [here](https://datarootsio.github.io/tutorial-great-expectations/validation).

Just like for validation results, different storage backends can be configured for your data docs. You could, for example, host them on cloud storage for easy viewing. Refer to [configuring data docs](https://docs.greatexpectations.io/en/latest/guides/how_to_guides/configuring_data_docs.html) for more information.

<a id="section-checkpoints"></a>
## Checkpoints

Remember how we launched a validation run back in the [Expectation Suite section](#section-expectation-suite). There, we wrote code to run the validation on the data batch and expectation suite that we defined earlier on. If we bundle all these run parameters in a single configuration file, we could easily rerun the validation, for example each time our data changes. Such a configuration file is called a `Checkpoint` in Great Expectations.

As a quick reminder, for running a validation we need:
- A *validation operator* to handle the validation results
- A list of *batches*, each consisting of
    - A batch of data to check
    - expectation suites to check against
    
To create a checkpoint, we simply create a file in the `checkpoints` directory of our great_expectations configuration. We'll create the file manually now for demonstration purposes, but when doing this in your own project you probably want to use the CLI [[The Great Expectations CLI]](#section-cli), which will help you along the way.

In [None]:
%%writefile great_expectations/checkpoints/avocado_data.yml

validation_operator_name: my_validation_operator
batches:
  - batch_kwargs:
      path: data/avocado.csv
      datasource: data_dir
      data_asset_name: avocado
      reader_method: read_csv
      reader_options:
        index_col: 0
    expectation_suite_names:
      - check_avocado_data

The `batch_kwargs` property specifies how the data asset should be loaded. You might recognise the parameters from when we first loaded the `avocado.csv` file.

This might also be a good time to point out that our data batch will get read by pandas under the hood (we configured that in the `data_dir` data source). In `batch_kwargs`, we specify that we'd like to use the pandas `read_csv` method, which will receive the `reader_options` dict as additional parameters.
For more information on batches, check out the [creating batches](https://docs.greatexpectations.io/en/latest/guides/how_to_guides/creating_batches.html) guide.

The checkpoint can be executed by using the great_expectations cli:

In [None]:
!great_expectations checkpoint run avocado_data

So, to summarize: a checkpoint is a _runnable check_ for your data. They are your first stop for integrating Great Expectations into your pipelines and workflows.
For more info on how to do that, refer to the [validation guides](https://docs.greatexpectations.io/en/latest/guides/how_to_guides/validation.html), or the [workflows and patterns](https://docs.greatexpectations.io/en/latest/guides/workflows_patterns.html) guides.

Checkpoints and batches are represented visually below.

<img src="figures/checkpoint.png" width=600px>
<img src="figures/batch.png" width=600px>

<a id="section-profiling"></a>
## Profiling

In the previous sections we explored how we could get some metrics about our data using expectations. But what if you don't know what exactly to expect of your data? Well, you could try using Great Expectations' profiling feature, which can try to extract some useful metrics from your data. To try profiling our preconfigured `data_dir` data source, we can use the CLI:

In [None]:
!great_expectations datasource profile data_dir -y

If you are running on your own OS, running that command should have opened the freshly built data docs in your browser. If not, you can view the [results from our run](https://datarootsio.github.io/tutorial-great-expectations/profiling). You can find the results in the `Profiling Results` tab. The profiler also generated an expectation suite based on its observations, which you can find in the `Expectation Suites` tab. Be mindful that this is an experimental feature and the generated suite is usually not that helpful, but it could be a good starting point for writing your own.


If you'd like to know more about profiling, the [profiling reference](https://docs.greatexpectations.io/en/latest/reference/spare_parts/profiling_reference.html) can help you out.

<a id="section-cli"></a>
## The Great Expectations CLI

For the purposes of this tutorial, we mostly interacted directly with Great Expectations. If you are going to set up and use Great Expectations for yourself, we recommend using the CLI as much as possible. The concepts should be familiar by now - refer to the  [CLI guide](https://docs.greatexpectations.io/en/latest/guides/how_to_guides/miscellaneous/command_line.html) for more.

In [None]:
!great_expectations --help

<a id="section-setup"></a>
## Setting up your own project

To initialize your own project, run `great_expectations init` and follow the instructions. This will scaffold a simple configuration for you, just like the one we provided.

Once you created your suite using `great_expectations suite new`, you can use the `great_expectations suite edit` command to open up an auto-generated notebook that you can use to set up your suite. You should be able to recognise the structure of the first part of this notebook a bit ;-)

The [getting started guide](https://docs.greatexpectations.io/en/latest/guides/tutorials/getting_started.html) can  help you along the way. For ideas on how Great Expectation can fit into your workflow, check out [Deployment patterns](https://docs.greatexpectations.io/en/latest/reference/core_concepts/validation.html#deployment-patterns).

<a id="section-conclusion"></a>
## Final words

Just to recap, in this tutorial notebook, we started by giving you an overview of the tool and its purpose. We then showed you how to get started with the Python library and define your expectations. We saw that expectations can be bundled as suites, which can be used with validation operators to produce validation results. We had a look at data docs, a clean way to visualize your results and data documentation. We then dived into the data context, showing how the tool is configured. We had a look at checkpoints, which allow you to automate your data testing. We talked a bit about profiling, an experimental feature to generate expectations from given data. Finally, we introduced you to the CLI and set you on the right path to start using Great Expectations right away!

We hope you enjoyed the tutorial and wish you all the best in using Great Expectations with your projects!