In [1]:
#| code-fold: true

import datetime as dt
from pathlib import Path
from urllib.request import urlretrieve

import geopandas as gpd
import pandas as pd

Great Expectations is an open-source Python-based library that brings the idea of "testing" to your data. It enables you to define expectations for properties of your datasets (like records per batch, distribution of values in a column, columns in a table, etc) and check that the data meets those expectations when the data is updated.

# Workflow Overview
The high-level `great_expectations` workflow follows this pattern:

0. Install `great_expectations`.
1. Create (or load) a [`DataContext`](https://docs.greatexpectations.io/docs/guides/setup/configuring_data_contexts/instantiating_data_contexts/instantiate_data_context) for your project.
2. Connect [`Datasources`](https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/connect_to_data_lp/) (and their `DataAssets`) to your `DataContext`.
3. Define [`Expectations`](https://docs.greatexpectations.io/docs/guides/expectations/expectations_lp) for your `DataAssets`.
4. Set [`Checkpoints`](https://docs.greatexpectations.io/docs/guides/validation/validate_data_lp) to validate your `DataAssets`.

In this post, I'll demonstrate this workflow with local-file-based `DataAssets` in a local filesystem `DataContext` (although `great_expectations` also works with SQL-based and cloud-bucket-based `DataAssets`).

## 0. Great Expectations Setup

First, you'll need to install the `great_expectations`. If you already have `conda` installed on your machine, you can easily set up a conda env just like the one used to run this notebook by:
1. copying the `gx_env_environment.yml` file in the same dir as this notebook file to your machine,
2. open a terminal and navigate to the dir with that new file, and
3. run command `conda env create -f environment.yml`

## Sample Data Collection and Preparation

Many data pipelines process data in discrete, periodic batches. I'm going to simulate that situation by downloading a dataset, splitting it into 1-month batches (based on a datetime-like column), and then write each of those 1-month batches to file. I'm going to use the [Food Inspection dataset](https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5) on Chicago's public data portal.

In [4]:
#| code-fold: true

POST_DIR = Path(".").resolve()
POST_DATA_DIR = POST_DIR.joinpath("data")
POST_DATA_DIR.mkdir(exist_ok=True)

# First, we need to download the data to our local machine.
url = "https://data.cityofchicago.org/api/geospatial/4ijn-s7e5?method=export&format=GeoJSON"
full_file_path = POST_DATA_DIR.joinpath("full_food_inspections.geojson")
if not full_file_path.is_file():
    urlretrieve(url=url, filename=full_file_path)
food_inspection_gdf = gpd.read_file(full_file_path)

# For some reason, Socrata adds on these four always-null location columns on
#   to geospatial exports. I'm going to remove them.
location_cols = ["location_state", "location_zip", "location_address", "location_city"]
# uncomment the lines below to confirm those columns are always empty
# print("Rows with a non-null value in these location_xxx columns:")
# display(food_inspection_gdf[location_cols].notnull().sum())
food_inspection_gdf = food_inspection_gdf.drop(columns=location_cols)

# That column ordering is a bit chaotic, so I'll reorder them (for readability).
col_order = [
    "inspection_id", "inspection_date", "dba_name", "aka_name", "license_", "facility_type",
    "risk", "inspection_type", "results", "address", "city", "state", "zip", "violations",
    "longitude", "latitude", "geometry"
]
food_inspection_gdf = food_inspection_gdf[col_order].copy()

# I also want to break this into batches based on the dates, so I need to cast
#   the `inspection_date` to a datetime type.
food_inspection_gdf["inspection_date"] = pd.to_datetime(
    food_inspection_gdf["inspection_date"]
)

# I'll also cast string and numeric features to their proper dtypes.
food_inspection_gdf = food_inspection_gdf.convert_dtypes()
food_inspection_gdf["inspection_id"] = food_inspection_gdf["inspection_id"].astype("Int64")
food_inspection_gdf["license_"] = food_inspection_gdf["license_"].astype("Int64")
food_inspection_gdf["longitude"] = food_inspection_gdf["longitude"].astype("Float64")
food_inspection_gdf["latitude"] = food_inspection_gdf["latitude"].astype("Float64")

# I'll also just make all string uppercase (to reduce cardinality)
str_cols = list(food_inspection_gdf.head(2).select_dtypes(include="string").columns)
food_inspection_gdf[str_cols] = food_inspection_gdf[str_cols].apply(lambda x: x.str.upper())

Here's a sample of the dataset (after some light preprocessing and `dtype`setting).

In [25]:
print(food_inspection_gdf.shape)
food_inspection_gdf.head(2)

(255573, 17)


Unnamed: 0,inspection_id,inspection_date,dba_name,aka_name,license_,facility_type,risk,inspection_type,results,address,city,state,zip,violations,longitude,latitude,geometry
0,2577546,2023-06-20,CIAO RAGAZZI,CIAO RAGAZZI,2808408,Restaurant,Risk 1 (High),Canvass,Pass w/ Conditions,5440 S NARRAGANSETT AVE,CHICAGO,IL,60638,10. ADEQUATE HANDWASHING SINKS PROPERLY SUPPLI...,-87.781975,41.793006,POINT (-87.78197 41.79301)
1,2577553,2023-06-20,CHISOX BAR & GRILL,CHI SOX BAR & GRILL,2078887,Restaurant,Risk 1 (High),Canvass,Pass,320 W 35TH ST,CHICAGO,IL,60616,51. PLUMBING INSTALLED; PROPER BACKFLOW DEVICE...,-87.634932,41.831008,POINT (-87.63493 41.83101)


In the (folded up) cell below, we split the dataset into batches and write each batch to file in this post's `./data` directory.

In [43]:
#| code-fold: true

# I want to split the data into 1-month batches, so I need to get the first day of the month
#   for every month between the earliest inspection and the month after the latest inspection
#   in our food inspection dataset.
month_start_dates = pd.date_range(
    start=food_inspection_gdf["inspection_date"].min() + pd.DateOffset(months=-1),
    end=food_inspection_gdf["inspection_date"].max() + pd.DateOffset(months=1),
    freq="MS",
)

# Here, we'll iterate through each of those month_start_dates, extract the batch of data,
#   format a filename containing the month_start_date, and write the batch to file.
for month_start_date in month_start_dates:
    batch_period = pd.to_datetime(month_start_date).strftime("%Y_%m")
    batch_data = food_inspection_gdf.loc[
        food_inspection_gdf["inspection_date"].between(
            left=month_start_date,
            right=month_start_date + pd.DateOffset(months=1),
            inclusive="left")
    ].copy()
    batch_file_path = POST_DATA_DIR.joinpath(f"food_inspection_batch_{batch_period}.parquet")
    if not batch_file_path.is_file():
        batch_data.to_parquet(batch_file_path, index=False)

## 1. Create or Load Great Expectations Data Context

In [13]:
import great_expectations as gx
from great_expectations.data_context import FileDataContext

context = FileDataContext.create(project_root_dir=POST_DIR)

## Create a Datasource

In [17]:
datasource_name = "food_inspection_datasource"

if any(el["name"] == datasource_name for el in context.list_datasources()):
    print(f"Datasource with name '{datasource_name}' found; loading now")
    datasource = context.get_datasource(datasource_name)
else:
    print(f"No Datasource with name '{datasource_name}' found; creating now")
    datasource = context.sources.add_pandas_filesystem(
        name=datasource_name,
        base_directory=POST_DATA_DIR
    )

Datasource with name 'food_inspection_datasource' found; loading now


True

In [20]:
[el for el in dir(datasource) if el.startswith("add_")]

['add_csv_asset',
 'add_excel_asset',
 'add_feather_asset',
 'add_fwf_asset',
 'add_hdf_asset',
 'add_html_asset',
 'add_json_asset',
 'add_orc_asset',
 'add_parquet_asset',
 'add_pickle_asset',
 'add_sas_asset',
 'add_spss_asset',
 'add_stata_asset',
 'add_xml_asset']

In [29]:
data_asset_name = "food_inspections_asset"

if data_asset_name not in datasource.get_asset_names():
    print(f"Creating data asset {data_asset_name}")
    data_asset = datasource.add_parquet_asset(
        name=data_asset_name,
        batching_regex = r"food_inspection_batch_(?P<year>\d{4})_(?P<month>\d{2})\.parquet"
    )
else:
    data_asset = datasource.get_asset(data_asset_name)

Creating data asset food_inspections_asset


I'll also sort these batches.

In [30]:
data_asset = data_asset.add_sorters(["+year", "+month"])

In [31]:
batch_request = data_asset.build_batch_request()
batches = data_asset.get_batch_list_from_batch_request(batch_request)

## Using the profiler to create basic expectations

In [35]:
expectation_suite_name = "food_inspections_suite"

expectation_suite = context.add_or_update_expectation_suite(
    expectation_suite_name=expectation_suite_name
)

In [36]:
data_assistant_result = context.assistants.onboarding.run(
    batch_request=batch_request,
    exclude_column_names=[],
)




Generating Expectations:   0%|          | 0/8 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/162 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/324 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/324 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/162 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/5670 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/0 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/162 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/5670 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/0 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/162 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/5670 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/5 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/810 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/162 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/810 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/810 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/810 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/162 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/810 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/810 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/810 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/162 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/810 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/810 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/810 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/162 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/810 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/810 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/810 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/162 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/810 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/810 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/0 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/15 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/972 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/3726 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/972 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/3726 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/972 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/3726 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/972 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/3726 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/972 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/3726 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/972 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/3726 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/972 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/3726 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/972 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/3726 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/972 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/3726 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/972 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/3726 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/972 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/3726 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/972 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/3726 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/972 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/3726 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/972 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/3726 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/972 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/3726 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/3078 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1134 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1134 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1134 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1134 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1134 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1134 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1134 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/648 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1134 [00:00<?, ?it/s]

In [47]:
data_assistant_result.plot_expectations_and_metrics(exclude_column_names=["inspection_date"])

62 Expectations produced, 9 Expectation and Metric plots implemented
Use DataAssistantResult.show_expectations_by_domain_type() or
DataAssistantResult.show_expectations_by_expectation_type() to show all produced Expectations


interactive(children=(Dropdown(description='Select Plot Type: ', layout=Layout(margin='0px', width='max-contenâ€¦



In [40]:
expectation_suite = data_assistant_result.get_expectation_suite(
    expectation_suite_name=expectation_suite_name
)

In [48]:
# expectation_suite.show_expectations_by_expectation_type()

In [50]:
saved_suite = context.add_or_update_expectation_suite(expectation_suite=expectation_suite)