# Data contracts

Data contracts are the way PyRetailScience to ensure that the data entering a function is as expected. For instance, at a
customer/product granularity. They are based on Great Expectations and easy to extend based on your own data and your
expectations of how it should look.

Let's start off by generating some simulated data.

In [1]:
from pyretailscience.data.simulation import Simulation
import pandas as pd

config_file = "../data/default_data_config.yaml"

sim = Simulation.from_config_file(seed=42, config_file=config_file)
sim.run()

transactions_df = pd.DataFrame(sim.transactions)
transactions_df.head()

Simulating days: 100%|██████████| 729/729 [02:07<00:00,  5.70it/s]


Unnamed: 0,transaction_id,transaction_datetime,customer_id,product_id,product_name,category_0_name,category_0_id,category_1_name,category_1_id,brand_name,brand_id,unit_price,quantity,total_price,store_id
0,12106,2022-01-14 09:50:01,1,1682,Streamer LX,Music,10,Bass Guitars,67,Warwick,337,1299.99,0,0.0,2
1,6846,2022-02-14 09:33:01,1,651,Old Skool,Clothing,4,Shoes,26,Vans,131,60.0,2,120.0,6
2,6846,2022-02-14 09:33:01,1,1742,Conservatoire Violin,Music,10,String Instruments,70,Stentor,349,329.99,0,0.0,6
3,43779,2022-03-11 16:02:40,1,350,Crunchy Butterfly Shrimp,Grocery,2,Meat & Seafood,14,Gorton's,70,7.49,1,7.49,7
4,43779,2022-03-11 16:02:40,1,580,Athletic Taper Jeans with GapFlex,Clothing,4,Jeans,23,Gap,116,65.0,6,390.0,7


There are several built in contracts.

**TransactionItemLevelContract**

This contract is for data that is at a transaction/product granularity and is typical when you need to perform any kind of detailed analysis.

**TransactionLevelContract**

This contract is similar to the `TransactionItemLevelContract` but lacks any information about a product in a transaction.

**CustomerLevelContract**

This contract is datasets that contains one row of data per customer.

Here is a simple example of how validating that a dataset meets the requirements of a data contract. In this case we'll ensure that the data we just simulated meets the requirements of the TransactionItemLevelContract.

In [2]:
from pyretailscience.data import contracts

tc = contracts.TransactionItemLevelContract(transactions_df)
tc.validate(contracts.EExpectationSet.BASIC)

True

As expected the data meets the expectations of the contract. Since Data Contracts are powered by the Great Expectations package we can take a deeper look at the result of the validation.

In [3]:
tc.validation_result

{
  "success": true,
  "results": [
    {
      "success": true,
      "expectation_config": {
        "expectation_type": "expect_column_to_exist",
        "kwargs": {
          "column": "transaction_id"
        },
        "meta": {}
      },
      "result": {},
      "meta": {},
      "exception_info": {
        "raised_exception": false,
        "exception_message": null,
        "exception_traceback": null
      }
    },
    {
      "success": true,
      "expectation_config": {
        "expectation_type": "expect_column_to_exist",
        "kwargs": {
          "column": "transaction_datetime"
        },
        "meta": {}
      },
      "result": {},
      "meta": {},
      "exception_info": {
        "raised_exception": false,
        "exception_message": null,
        "exception_traceback": null
      }
    },
    {
      "success": true,
      "expectation_config": {
        "expectation_type": "expect_column_to_exist",
        "kwargs": {
          "column": "customer_id"
   

The Data Contract has two sets of expectation. **Basic Expectations** are a set of expectations that can be quickly run, such as checking whether a column exists in a DataFrame. You can see these by returning the `basic_expectations` property of the contract.

In [4]:
tc.basic_expectations

[{"expectation_type": "expect_column_to_exist", "kwargs": {"column": "transaction_id"}, "meta": {}},
 {"expectation_type": "expect_column_to_exist", "kwargs": {"column": "transaction_datetime"}, "meta": {}},
 {"expectation_type": "expect_column_to_exist", "kwargs": {"column": "customer_id"}, "meta": {}},
 {"expectation_type": "expect_column_to_exist", "kwargs": {"column": "total_price"}, "meta": {}},
 {"expectation_type": "expect_column_to_exist", "kwargs": {"column": "store_id"}, "meta": {}},
 {"expectation_type": "expect_column_to_exist", "kwargs": {"column": "product_id"}, "meta": {}},
 {"expectation_type": "expect_column_to_exist", "kwargs": {"column": "product_name"}, "meta": {}},
 {"expectation_type": "expect_column_to_exist", "kwargs": {"column": "unit_price"}, "meta": {}},
 {"expectation_type": "expect_column_to_exist", "kwargs": {"column": "quantity"}, "meta": {}}]

**Extended Expectations** are those that might be slower to run on a regular basis. For instance checking whether a set of columns are unique in combination. For large datasets you many not want to run these often. You can see these by returning the `extended_expectations` property of the contract.

In [5]:
tc.extended_expectations

[{"expectation_type": "expect_compound_columns_to_be_unique", "kwargs": {"column_list": ["transaction_id", "transaction_datetime", "customer_id", "store_id"]}, "meta": {}},
 {"expectation_type": "expect_transaction_product_quantity_sign_to_be_unique", "kwargs": {}, "meta": {}},
 {"expectation_type": "expect_column_values_to_not_be_null", "kwargs": {"column": "transaction_id"}, "meta": {}},
 {"expectation_type": "expect_column_values_to_not_be_null", "kwargs": {"column": "transaction_datetime"}, "meta": {}},
 {"expectation_type": "expect_column_values_to_not_be_null", "kwargs": {"column": "customer_id"}, "meta": {}},
 {"expectation_type": "expect_column_values_to_not_be_null", "kwargs": {"column": "total_price"}, "meta": {}},
 {"expectation_type": "expect_column_values_to_not_be_null", "kwargs": {"column": "store_id"}, "meta": {}},
 {"expectation_type": "expect_column_values_to_not_be_null", "kwargs": {"column": "product_id"}, "meta": {}},
 {"expectation_type": "expect_column_values_to_

Together these serve as a basic form of data documentation that helps get others up to speed on your data and your expectations of it.

Now let's see what happens when your data doesn't meet the expectations of a contract. The **TransactionItemLevelContract** assumes that a column `transaction_datetime` is present in the data and it represents the date and time of a transaction. As such, it should always be populated.

In [6]:
transactions_df["transaction_datetime"] = None
tc = contracts.TransactionItemLevelContract(transactions_df)
tc.validate(contracts.EExpectationSet.EXTENDED)

False

You can see that the data now fails the contract. If you want more data on exactly what failed and why, set `verbose=True` when validating. In this case the contract fails for two reasons. The originally mentioned requirement for a transaction to have a datetime and because the combination of several columns is now not unqiue.

In [7]:
tc.validate(contracts.EExpectationSet.EXTENDED, verbose=True)

expect_column_values_to_not_be_null
{'column': 'transaction_datetime'}
expect_compound_columns_to_be_unique
{'column_list': ['transaction_id', 'transaction_datetime', 'customer_id', 'store_id']}
expect_transaction_product_quantity_sign_to_be_unique
{}
'PyRetailSciencePandasDataset' object has no attribute 'expect_transaction_product_quantity_sign_to_be_unique'
Traceback (most recent call last):
  File "/home/devbox/ds_repos/pyretailscience/.venv/lib/python3.10/site-packages/great_expectations/data_asset/data_asset.py", line 804, in validate
    expectation_method = getattr(self, expectation.expectation_type)
  File "/home/devbox/ds_repos/pyretailscience/.venv/lib/python3.10/site-packages/pandas/core/generic.py", line 6204, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'PyRetailSciencePandasDataset' object has no attribute 'expect_transaction_product_quantity_sign_to_be_unique'



False

When we transform the data we may now have a new set of expectations for that data. As a result we may want to revalidate that data with a new contract. For instance, let's say we want to perform some customer level analysis. The `CustomerLevelContract` serves as the basis for this, and ensure that only one row per customer exists.

In [8]:
customer_df = transactions_df.groupby("customer_id")["total_price"].sum().reset_index()
customer_df.head()

Unnamed: 0,customer_id,total_price
0,1,65020.36
1,2,93233.22
2,3,46035.98
3,4,58687.99
4,5,34679.12


In [9]:
cc = contracts.CustomerLevelContract(customer_df)
cc.validate(contracts.EExpectationSet.BASIC)

True

You can see the customer level data contract is much simpler

In [10]:
cc.basic_expectations + cc.extended_expectations

[{"expectation_type": "expect_column_to_exist", "kwargs": {"column": "customer_id"}, "meta": {}},
 {"expectation_type": "expect_column_values_to_be_unique", "kwargs": {"column": "customer_id"}, "meta": {}}]

Now let's see how to use them to ensure that the data is meeting the expectations for your analysis. 

We'll do some analysis that needs to be done at the customer level so we let's validate our data with the extended CustomerLevelContract 

This contract requeires that the customer ID is unique, which is what we need for our simple top customer report

In [14]:
def top_customers(df, n=5) -> pd.DataFrame:
    """Returns the top n customers by total price spent."""
    cc = contracts.CustomerLevelContract(df)
    if not cc.validate(contracts.EExpectationSet.EXTENDED):
        raise Exception(f"Customer level contract failed validation. {cc.validation_result}")

    return df.sort_values("total_price", ascending=False).head(n).reset_index(drop=True)

display(top_customers(customer_df, n=3))

Unnamed: 0,customer_id,total_price
0,1546,311418.68
1,316,297522.84
2,1680,296068.84


Now let's violate the assumptions of the data and add duplicate customers in the data. In this case since the assumptions of the data have changed we'd either have corrupted data or we'd need to run a groupby to aggregate the data to the correct granularity needed by our function.

In [12]:
bad_customer_df = pd.concat([customer_df, customer_df])
display(top_customers(bad_customer_df, n=3))

False


Exception: Customer level contract failed validation. {
  "success": false,
  "results": [
    {
      "success": true,
      "expectation_config": {
        "expectation_type": "expect_column_to_exist",
        "kwargs": {
          "column": "customer_id"
        },
        "meta": {}
      },
      "result": {},
      "meta": {},
      "exception_info": {
        "raised_exception": false,
        "exception_message": null,
        "exception_traceback": null
      }
    },
    {
      "success": false,
      "expectation_config": {
        "expectation_type": "expect_column_values_to_be_unique",
        "kwargs": {
          "column": "customer_id"
        },
        "meta": {}
      },
      "result": {
        "element_count": 4716,
        "missing_count": 0,
        "missing_percent": 0.0,
        "unexpected_count": 4716,
        "unexpected_percent": 100.0,
        "unexpected_percent_total": 100.0,
        "unexpected_percent_nonmissing": 100.0,
        "partial_unexpected_list": [
          1,
          2,
          3,
          4,
          5,
          6,
          7,
          8,
          9,
          10,
          11,
          12,
          13,
          14,
          15,
          16,
          17,
          18,
          19,
          20
        ]
      },
      "meta": {},
      "exception_info": {
        "raised_exception": false,
        "exception_message": null,
        "exception_traceback": null
      }
    }
  ],
  "evaluation_parameters": {},
  "statistics": {
    "evaluated_expectations": 2,
    "successful_expectations": 1,
    "unsuccessful_expectations": 1,
    "success_percent": 50.0
  },
  "meta": {
    "great_expectations_version": "0.18.8",
    "expectation_suite_name": "expectations",
    "run_id": {
      "run_name": null,
      "run_time": "2024-02-15T16:24:56.352062+01:00"
    },
    "batch_kwargs": {
      "ge_batch_id": "5fca82b2-cc16-11ee-9a9a-00155d8f1107"
    },
    "batch_markers": {},
    "batch_parameters": {},
    "validation_time": "20240215T152456.351998Z",
    "expectation_suite_meta": {
      "great_expectations_version": "0.18.8"
    }
  }
}

### Custom Contracts

The out of the box contracts are great to get started with, but you can extend them to create contracts based around your data and your expectations of it. For instance, let's assume that no customer will ever have greater than $100,000 or less than $0 spend. You can create a contract with those expectations.

Let's start with the existing CustomerLevelContract and add on our assumptions.


In [15]:
from great_expectations.core.expectation_configuration import ExpectationConfiguration


class CustomCustomerLevelContract(contracts.CustomerLevelContract):
    """A custom customer level contract that adds a new expectation to the basic and extended sets. This contract
    ensures that the total_price column exists and is within a certain range."""

    MIN_CUSTOMER_TOTAL_PRICE = 0
    MAX_CUSTOMER_TOTAL_PRICE = 10_000

    def __init__(self, df: pd.DataFrame):
        # Make sure the total_price column is in the DataFrame
        self.basic_expectations.extend(
            [
                ExpectationConfiguration(
                    expectation_type="expect_column_to_exist",
                    kwargs={"column": "total_price"},
                ),
            ]
        )

        self.extended_expectations.extend(
            [
                # Make sure the total_price column does not contain null values
                ExpectationConfiguration(
                    expectation_type="expect_column_values_to_not_be_null",
                    kwargs={"column": "total_price"},
                ),
                # And that it is within a certain range
                ExpectationConfiguration(
                    expectation_type="expect_column_values_to_be_between",
                    kwargs={
                        "column": "total_price",
                        "min_value": self.MIN_CUSTOMER_TOTAL_PRICE,
                        "max_value": self.MAX_CUSTOMER_TOTAL_PRICE,
                    },
                ),
            ]
        )

        super().__init__(df)

Now with this new contract created let's test our data to see if it meets expectations

In [16]:
custom_contract = CustomCustomerLevelContract(customer_df)
custom_contract.validate(contracts.EExpectationSet.EXTENDED, verbose=True)

expect_column_values_to_be_between
{'column': 'total_price', 'min_value': 0, 'max_value': 10000}


False

In [17]:
customer_df["total_price"].agg(["min","max"])

min     13764.80
max    311418.68
Name: total_price, dtype: float64

We can see that some of the customers in the DataFrame had `total_price` values outside of our required range. Let's clip these values and see if it passes this time.

In [18]:
customer_df = customer_df.clip(lower=0, upper=10_000)
custom_contract = CustomCustomerLevelContract(customer_df)
custom_contract.validate(contracts.EExpectationSet.EXTENDED, verbose=True)

True

Nice! With the values clipped the contract validates.