## Setting up the feast

Feast supports the great expectations library to validate the quality of the feature data. If you have not installed these libraries already please install them.

In [1]:
!pip install 'feast[ge]'

Collecting great-expectations<0.16.0,>=0.15.41 (from feast[ge])
  Obtaining dependency information for great-expectations<0.16.0,>=0.15.41 from https://files.pythonhosted.org/packages/1a/c7/d0038e9c14c207fc4e5103dbd3339c2e4f2865904d28211a172d474dcec6/great_expectations-0.15.50-py3-none-any.whl.metadata
  Downloading great_expectations-0.15.50-py3-none-any.whl.metadata (10 kB)
Collecting altair<4.2.1,>=4.0.0 (from great-expectations<0.16.0,>=0.15.41->feast[ge])
  Obtaining dependency information for altair<4.2.1,>=4.0.0 from https://files.pythonhosted.org/packages/0a/fb/56aaac0c69d106e380ff868cd5bb6cccacf2b8917a8527532bc89804a52e/altair-4.2.0-py3-none-any.whl.metadata
  Downloading altair-4.2.0-py3-none-any.whl.metadata (13 kB)
Collecting makefun<2,>=1.7.0 (from great-expectations<0.16.0,>=0.15.41->feast[ge])
  Obtaining dependency information for makefun<2,>=1.7.0 from https://files.pythonhosted.org/packages/47/45/51d50062d95a0c2fd8f5f1cc8849878ea5c76d2f6a049a0b9d449272e97f/makefun-1.1

## Setting up the feast repo and features

As part of this excercise, we are going to use `trips_stats.parquet` and `entities.parquet` files as our data sources.

We are going to configure the feast repo and define features. 

In [2]:
import pyarrow.parquet
import pandas as pd

from feast import FeatureView, Entity, FeatureStore, Field, BatchFeatureView
from feast.types import Float64, Int64
from feast.value_type import ValueType
from feast.data_format import ParquetFormat
from feast.on_demand_feature_view import on_demand_feature_view
from feast.infra.offline_stores.file_source import FileSource
from feast.infra.offline_stores.file import SavedDatasetFileStorage
from datetime import timedelta



In [3]:
batch_source = FileSource(
    timestamp_field="day",
    path="trips_stats.parquet",  # using parquet file that we created on previous step
    file_format=ParquetFormat()
)

In [4]:
taxi_entity = Entity(name='taxi', join_keys=['taxi_id'])

In [5]:
trips_stats_fv = BatchFeatureView(
    name='trip_stats',
    entities=[taxi_entity],
    schema=[
        Field(name="total_miles_travelled", dtype=Float64),
        Field(name="total_trip_seconds", dtype=Float64),
        Field(name="total_earned", dtype=Float64),
        Field(name="trip_count", dtype=Int64),

    ],
    ttl=timedelta(seconds=86400),
    source=batch_source,
)



In [6]:
@on_demand_feature_view(
    sources=[
      trips_stats_fv,
    ],
    schema=[
        Field(name="avg_fare", dtype=Float64),
        Field(name="avg_speed", dtype=Float64),
        Field(name="avg_trip_seconds", dtype=Float64),
        Field(name="earned_per_hour", dtype=Float64),
    ]
)
def on_demand_stats(inp: pd.DataFrame) -> pd.DataFrame:
    out = pd.DataFrame()
    out["avg_fare"] = inp["total_earned"] / inp["trip_count"]
    out["avg_speed"] = 3600 * inp["total_miles_travelled"] / inp["total_trip_seconds"]
    out["avg_trip_seconds"] = inp["total_trip_seconds"] / inp["trip_count"]
    out["earned_per_hour"] = 3600 * inp["total_earned"] / inp["total_trip_seconds"]
    return out

In [7]:
store = FeatureStore(".")  # using feature_store.yaml that stored in the same directory

In [8]:
store.apply([taxi_entity, trips_stats_fv, on_demand_stats])  # writing to the registry



Now we have completed setting up the feast repo and features.

## Generating Training Dataset

We are going to generate the data from `entities.parquet` and load into the feast repo we have created into the previous step.

In [9]:
taxi_ids = pyarrow.parquet.read_table("entities.parquet").to_pandas()

Generating range of timestamps with daily frequency:

In [10]:
timestamps = pd.DataFrame()
timestamps["event_timestamp"] = pd.date_range("2019-06-01", "2019-07-01", freq='D')

Cross merge (aka relation multiplication) produces entity dataframe with each taxi_id repeated for each timestamp:

In [11]:
entity_df = pd.merge(taxi_ids, timestamps, how='cross')
entity_df

Unnamed: 0,taxi_id,event_timestamp
0,91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...,2019-06-01
1,91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...,2019-06-02
2,91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...,2019-06-03
3,91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...,2019-06-04
4,91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...,2019-06-05
...,...,...
156979,7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...,2019-06-27
156980,7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...,2019-06-28
156981,7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...,2019-06-29
156982,7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...,2019-06-30


Retrieving historical features for resulting entity dataframe and persisting output as a saved dataset:

In [12]:
job = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "trip_stats:total_miles_travelled",
        "trip_stats:total_trip_seconds",
        "trip_stats:total_earned",
        "trip_stats:trip_count",
        "on_demand_stats:avg_fare",
        "on_demand_stats:avg_trip_seconds",
        "on_demand_stats:avg_speed",
        "on_demand_stats:earned_per_hour",
    ]
)

store.create_saved_dataset(
    from_=job,
    name='my_training_ds',
    storage=SavedDatasetFileStorage(path='my_training_ds.parquet')
)



<SavedDataset(name = my_training_ds, features = ['trip_stats:total_miles_travelled', 'trip_stats:total_trip_seconds', 'trip_stats:total_earned', 'trip_stats:trip_count', 'on_demand_stats:avg_fare', 'on_demand_stats:avg_trip_seconds', 'on_demand_stats:avg_speed', 'on_demand_stats:earned_per_hour'], join_keys = ['taxi_id'], storage = <feast.infra.offline_stores.file_source.SavedDatasetFileStorage object at 0x12a785b10>, full_feature_names = False, tags = {}, feature_service_name = None, _retrieval_job = <feast.infra.offline_stores.file.FileRetrievalJob object at 0x12a7808d0>, min_event_timestamp = 2019-06-01 00:00:00+00:00, max_event_timestamp = 2019-07-01 00:00:00+00:00, created_timestamp = 2024-06-18 18:51:39.099184, last_updated_timestamp = 2024-06-18 18:51:39.099184)>

## Developing dataset profiler(AKA setting up the data Validation Rules)

Dataset profiler is a function that accepts dataset and generates set of its characteristics. This charasteristics will be then used to evaluate (validate) next datasets.

**Important: datasets are not compared to each other! Feast use a reference dataset and a profiler function to generate a reference profile. This profile will be then used during validation of the tested dataset.**

In [13]:
import numpy as np

from feast.dqm.profilers.ge_profiler import ge_profiler

from great_expectations.core.expectation_suite import ExpectationSuite
from great_expectations.dataset import PandasDataset

  from pkg_resources import Distribution
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)


Loading saved dataset first and exploring the data:



In [14]:
ds = store.get_saved_dataset('my_training_ds')
ds.to_df()

Unnamed: 0,earned_per_hour,total_trip_seconds,total_miles_travelled,avg_speed,event_timestamp,taxi_id,avg_trip_seconds,avg_fare,trip_count,total_earned
0,45.559701,16080,69.50,15.559701,2019-06-01 00:00:00+00:00,d13c5aaa066f94b4927779ed24cd313b0c686f03407095...,2010.000000,25.437500,8,203.50
1,36.219512,7380,15.80,7.707317,2019-06-01 00:00:00+00:00,33164e16dd29b1c58cd15cce31df4bfcb75d9903cb66de...,1476.000000,14.850000,5,74.25
2,54.212598,7620,38.50,18.188976,2019-06-01 00:00:00+00:00,226fe0b00be42932bdff81bc0b318b883bfbf15dd48093...,1270.000000,19.125000,6,114.75
3,45.000000,5660,20.22,12.860777,2019-06-01 00:00:00+00:00,5a5bed1b5ced617d0594007d591f10bbbca354d50b19ca...,1415.000000,17.687500,4,70.75
4,53.783319,6978,34.49,17.793637,2019-06-01 00:00:00+00:00,b7f7dbb452c0fb980a0f2050a146147c1006fe5f34e3b0...,1395.600000,20.850000,5,104.25
...,...,...,...,...,...,...,...,...,...,...
119803,76.369295,4820,36.98,27.619917,2019-07-01 00:00:00+00:00,961263722c1beadafef2355412d672acac35e4054f6aaa...,1205.000000,25.562500,4,102.25
119804,52.677165,7620,29.00,13.700787,2019-07-01 00:00:00+00:00,8b07f9156e568a37d362463c84dbd1118b4eeb753bae50...,692.727273,10.136364,11,111.50
119805,54.649682,9420,31.00,11.847134,2019-07-01 00:00:00+00:00,a112879f10892d5c698ce150af17aa28615b6d005ca749...,588.750000,8.937500,16,143.00
119806,73.770492,4941,37.86,27.584699,2019-07-01 00:00:00+00:00,68fe14b9fc2d53de5ac349d47f80f43fea895e201a31e3...,1647.000000,33.750000,3,101.25


Feast uses [Great Expectations](https://docs.greatexpectations.io/docs/home/) as a validation engine and [ExpectationSuite](https://docs.greatexpectations.io/docs/oss/guides/validation/validate_data_lp/) as a dataset's profile. Hence, we need to develop a function that will generate ExpectationSuite. This function will receive instance of PandasDataset (wrapper around pandas.DataFrame) so we can utilize both Pandas DataFrame API and some helper functions from PandasDataset during profiling.

In [15]:
DELTA = 0.1  # controlling allowed window in fraction of the value on scale [0, 1]

@ge_profiler
def stats_profiler(ds: PandasDataset) -> ExpectationSuite:
    # simple checks on data consistency
    ds.expect_column_values_to_be_between(
        "avg_speed",
        min_value=0,
        max_value=60,
        mostly=0.99  # allow some outliers
    )

    ds.expect_column_values_to_be_between(
        "total_miles_travelled",
        min_value=0,
        max_value=500,
        mostly=0.99  # allow some outliers
    )

    # expectation of means based on observed values
    observed_mean = ds.trip_count.mean()
    ds.expect_column_mean_to_be_between("trip_count",
                                        min_value=observed_mean * (1 - DELTA),
                                        max_value=observed_mean * (1 + DELTA))

    observed_mean = ds.earned_per_hour.mean()
    ds.expect_column_mean_to_be_between("earned_per_hour",
                                        min_value=observed_mean * (1 - DELTA),
                                        max_value=observed_mean * (1 + DELTA))


    # expectation of quantiles
    qs = [0.5, 0.75, 0.9, 0.95]
    observed_quantiles = ds.avg_fare.quantile(qs)

    ds.expect_column_quantile_values_to_be_between(
        "avg_fare",
        quantile_ranges={
            "quantiles": qs,
            "value_ranges": [[None, max_value] for max_value in observed_quantiles]
        })

    return ds.get_expectation_suite()

Testing our profiler function:

In [16]:
ds.get_profile(profiler=stats_profiler)

<GEProfile with expectations: [
  {
    "expectation_type": "expect_column_values_to_be_between",
    "kwargs": {
      "column": "avg_speed",
      "min_value": 0,
      "max_value": 60,
      "mostly": 0.99
    },
    "meta": {}
  },
  {
    "expectation_type": "expect_column_values_to_be_between",
    "kwargs": {
      "column": "total_miles_travelled",
      "min_value": 0,
      "max_value": 500,
      "mostly": 0.99
    },
    "meta": {}
  },
  {
    "expectation_type": "expect_column_mean_to_be_between",
    "kwargs": {
      "column": "trip_count",
      "min_value": 10.387244591346153,
      "max_value": 12.695521167200855
    },
    "meta": {}
  },
  {
    "expectation_type": "expect_column_mean_to_be_between",
    "kwargs": {
      "column": "earned_per_hour",
      "min_value": 52.32062497564023,
      "max_value": 63.9474305257825
    },
    "meta": {}
  },
  {
    "expectation_type": "expect_column_quantile_values_to_be_between",
    "kwargs": {
      "column": "avg_fare"

**Verify that all expectations that we coded in our profiler are present here. Otherwise (if you can't find some expectations) it means that it failed to pass on the reference dataset (do it silently is default behavior of Great Expectations).**

## Validating the features data against the validation rules - Successful scenario.

Now we can create validation reference from dataset and profiler function:

In [17]:
validation_reference = ds.as_reference(name="validation_reference_dataset", profiler=stats_profiler)

and test it against our existing retrieval job

In [18]:
try:
    _ = job.to_df(validation_reference=validation_reference)
    print("Data passed all the validation rules")
except ValidationFailed as exc:
    print("Data Failed some or all the validation rules")
    print(exc.validation_report)



Data passed all the validation rules


Validation successfully passed as no exception were raised. Features data adheres to the data validation rules.

## Demonstrating the Validation Failure Use Case

In this section we will retrieve the historical features data which voilates the data characterstics defined in the previous step and hence throws the ValidationExeception

In [19]:
from feast.dqm.errors import ValidationFailed

In [20]:
timestamps = pd.DataFrame()
timestamps["event_timestamp"] = pd.date_range("2020-12-01", "2020-12-07", freq='D')

In [21]:
entity_df = pd.merge(taxi_ids, timestamps, how='cross')
entity_df

Unnamed: 0,taxi_id,event_timestamp
0,91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...,2020-12-01
1,91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...,2020-12-02
2,91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...,2020-12-03
3,91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...,2020-12-04
4,91d5288487e87c5917b813ba6f75ab1c3a9749af906a2d...,2020-12-05
...,...,...
35443,7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...,2020-12-03
35444,7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...,2020-12-04
35445,7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...,2020-12-05
35446,7ebf27414a0c7b128e7925e1da56d51a8b81484f7630cf...,2020-12-06


In [22]:
job = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "trip_stats:total_miles_travelled",
        "trip_stats:total_trip_seconds",
        "trip_stats:total_earned",
        "trip_stats:trip_count",
        "on_demand_stats:avg_fare",
        "on_demand_stats:avg_trip_seconds",
        "on_demand_stats:avg_speed",
        "on_demand_stats:earned_per_hour",
    ]
)

In [25]:
try:
    df = job.to_df(validation_reference=validation_reference)
    print("Data passed all the validation rules")
except ValidationFailed as exc:
    print("Data Failed some or all the validation rules, exception report:")
    print("------------------------------------------------")
    print(exc.validation_report)

Data Failed some or all the validation rules, exception report:
------------------------------------------------
[
  {
    "success": false,
    "expectation_config": {
      "expectation_type": "expect_column_mean_to_be_between",
      "kwargs": {
        "column": "trip_count",
        "min_value": 10.387244591346153,
        "max_value": 12.695521167200855,
        "result_format": "COMPLETE"
      },
      "meta": {}
    },
    "result": {
      "observed_value": 6.692920555429092,
      "element_count": 4393,
      "missing_count": null,
      "missing_percent": null
    },
    "meta": {},
    "exception_info": {
      "raised_exception": false,
      "exception_message": null,
      "exception_traceback": null
    }
  },
  {
    "success": false,
    "expectation_config": {
      "expectation_type": "expect_column_mean_to_be_between",
      "kwargs": {
        "column": "earned_per_hour",
        "min_value": 52.32062497564023,
        "max_value": 63.9474305257825,
        "resu

Validation failed since several expectations didn't pass:
 - Trip count (mean) decreased more than 10% (which is expected when comparing Dec 2020 vs June 2019)
 - Average Fare increased - all quantiles are higher than expected
 - Earn per hour (mean) increased more than 10% (most probably due to increased fare)