# Overview

In this tutorial, we'll use Feast to generate training data and power online model inference for a 
ride-sharing driver satisfaction prediction model. Feast solves several common issues in this flow:

1. **Training-serving skew and complex data joins:** Feature values often exist across multiple tables. Joining 
   these datasets can be complicated, slow, and error-prone.
   * Feast joins these tables with battle-tested logic that ensures _point-in-time_ correctness so future feature 
     values do not leak to models.
2. **Online feature availability:** At inference time, models often need access to features that aren't readily 
   available and need to be precomputed from other data sources.
   * Feast manages deployment to a variety of online stores (e.g. DynamoDB, Redis, Google Cloud Datastore) and 
     ensures necessary features are consistently _available_ and _freshly computed_ at inference time.
3. **Feature and model versioning:** Different teams within an organization are often unable to reuse 
   features across projects, resulting in duplicate feature creation logic. Models have data dependencies that need 
   to be versioned, for example when running A/B tests on model versions.
   * Feast enables discovery of and collaboration on previously used features and enables versioning of sets of 
     features (via _feature services_).
   * _(Experimental)_ Feast enables light-weight feature transformations so users can re-use transformation logic 
     across online / offline use cases and across models.

We will:
1. Deploy a local feature store with a **Parquet file offline store** and **Sqlite online store**.
2. Build a training dataset using our time series features from our **Parquet files**.
3. Materialize feature values from the offline store into the online store.
4. Read the latest features from the online store for inference.

## Step 1: Install Feast

Install Feast (and psycopg2-binary) using pip:


In [None]:
# Step 1: Install Feast 
!python3 -m pip install feast==0.29.0
!python3 -m pip install typeguard==2.13.3
!python3 -m pip install psycopg2-binary

## Step 2: Inspect a feature repository

A feature repository is a directory that contains the configuration of the feature store and individual features. This configuration is written as code (Python/YAML) and it's highly recommended that teams track it centrally using git. See [Feature Repository](https://docs.feast.dev/reference/feature-repository) for a detailed explanation of feature repositories.

The easiest way to create a new feature repository to use the `feast init` command. This creates a scaffolding with initial demo data.

### Demo data scenario 
- We have surveyed some drivers for how satisfied they are with their experience in a ride-sharing app. 
- We want to generate predictions for driver satisfaction for the rest of the users so we can reach out to potentially dissatisfied users.

### Step 2a: Inspecting the feature repository

Let's take a look at the demo repo itself. It breaks down into


* `data/` contains raw demo parquet data
* `definition.py` contains demo feature definitions
* `feature_store.yaml` contains a demo setup configuring where data sources are



In [10]:
!cd /mnt/shared/feast-store
!ls -R

.:
data  definitions.py  online-store.ipynb  __pycache__

./data:
driver_stats.parquet

./__pycache__:
definitions.cpython-38.pyc


### Step 2b: Inspecting the project configuration
Let's inspect the setup of the project in `feature_store.yaml`. 

The key line defining the overall architecture of the feature store is the **provider**. 

The provider value sets default offline and online stores. 
* The offline store provides the compute layer to process historical data (for generating training data & feature 
  values for serving). 
* The online store is a low latency store of the latest feature values (for powering real-time inference).

Valid values for `provider` in `feature_store.yaml` are:

* local: use file source with SQLite/Redis
* gcp: use BigQuery/Snowflake with Google Cloud Datastore/Redis
* aws: use Redshift/Snowflake with DynamoDB/Redis

Note that there are many other offline / online stores Feast works with, including Azure, Hive, Trino, and PostgreSQL via community plugins. See https://docs.feast.dev/roadmap for all supported connectors.

A custom setup can also be made by following [Customizing Feast](https://docs.feast.dev/v/master/how-to-guides/customizing-feast)

In [6]:
!cat /mnt/shared/feast-store/feature_store.yaml

project: ezaf_feast_demo_ride_sharing
registry: /mnt/shared/feast-store/data/registry.db
provider: local
offline_store:
  type: file
online_store:
  type: sqlite
entity_key_serialization_version: 2


### Inspecting the raw data

The raw feature data we have in this demo is stored in a local parquet file. The dataset captures hourly stats of a driver in a ride-sharing app.

In [1]:
import pandas as pd
from pathlib import Path

pd.read_parquet(f"{str(Path.home())}/user/Feast/data/driver_stats.parquet", engine="pyarrow")

Unnamed: 0,event_timestamp,driver_id,conv_rate,acc_rate,avg_daily_trips,created
0,2023-02-15 08:00:00+00:00,1005,0.718795,0.758693,679,2023-03-02 08:21:05.729
1,2023-02-15 09:00:00+00:00,1005,0.315331,0.682747,313,2023-03-02 08:21:05.729
2,2023-02-15 10:00:00+00:00,1005,0.288372,0.934601,783,2023-03-02 08:21:05.729
3,2023-02-15 11:00:00+00:00,1005,0.279908,0.104038,445,2023-03-02 08:21:05.729
4,2023-02-15 12:00:00+00:00,1005,0.868386,0.416725,430,2023-03-02 08:21:05.729
...,...,...,...,...,...,...
1802,2023-03-02 06:00:00+00:00,1001,0.294989,0.557884,194,2023-03-02 08:21:05.729
1803,2023-03-02 07:00:00+00:00,1001,0.009235,0.841543,132,2023-03-02 08:21:05.729
1804,2021-04-12 07:00:00+00:00,1001,0.323702,0.801096,624,2023-03-02 08:21:05.729
1805,2023-02-22 20:00:00+00:00,1003,0.593804,0.182048,417,2023-03-02 08:21:05.729


## Step 3: Register feature definitions and deploy your feature store

`feast apply` scans python files in the current directory for feature/entity definitions and deploys infrastructure according to `feature_store.yaml`.



### Step 3a: Inspecting feature definitions
Now we run `feast apply` to register the feature views and entities defined in `definition.py`, and sets up SQLite online store tables. Note that we had previously specified SQLite as the online store in `feature_store.yaml` by specifying a `local` provider.

In [2]:
### Step 3a: Inspecting feature definitions
!cat definitions.py

# This is an example feature definition file

from datetime import timedelta

import pandas as pd

from feast import (
    Entity,
    FeatureService,
    FeatureView,
    Field,
    FileSource,
    PushSource,
    RequestSource,
)
from feast.on_demand_feature_view import on_demand_feature_view
from feast.types import Float32, Float64, Int64

# Define an entity for the driver. You can think of an entity as a primary key used to
# fetch features.
driver = Entity(name="driver", join_keys=["driver_id"])

# Read data from parquet files. Parquet is convenient for local development mode. For
# production, you can use your favorite DWH, such as BigQuery. See Feast documentation
# for more info.
driver_stats_source = FileSource(
    name="driver_hourly_stats_source",
    path="/home/hpedemouser01/feast/data/driver_stats.parquet",
    timestamp_field="event_timestamp",
    created_timestamp_column="created",
)

# Our parquet files contain sample data that includes a driver_id column, timestamps

### Step 3b: Applying feature definitions
Now we run `feast apply` to register the feature views and entities defined in `definitions.py`, and sets up SQLite online store tables. Note that we had previously specified SQLite as the online store in `feature_store.yaml` by specifying a `local` provider.

In [1]:
from definitions import driver, driver_stats_source, driver_stats_fv, driver_stats_push_source, input_request, driver_activity_v1, driver_activity_v2, driver_activity_v3, driver_stats_fresh_fv
from definitions import transformed_conv_rate
from pprint import pprint
from feast import FeatureStore, Entity, FeatureView, Feature, ValueType, FileSource, RepoConfig

from datetime import datetime, timedelta
import pandas as pd

Store = "/mnt/shared/feast-store"
fs = FeatureStore(repo_path=Store)

fs.apply([driver, driver_stats_source, driver_stats_fv, driver_stats_push_source, input_request, driver_activity_v1, driver_activity_v2, transformed_conv_rate, driver_activity_v3, driver_stats_fresh_fv])

  _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)
  _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)
  _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)
  schema = ParquetDataset(path).schema.to_arrow_schema()


## Step 4: Generating training data or powering batch scoring models

To train a model, we need features and labels. Often, this label data is stored separately (e.g. you have one table storing user survey results and another set of tables with feature values). Feast can help generate the features that map to these labels.

Feast needs a list of **entities** (e.g. driver ids) and **timestamps**. Feast will intelligently join relevant 
tables to create the relevant feature vectors. There are two ways to generate this list:
1. The user can query that table of labels with timestamps and pass that into Feast as an _entity dataframe_ for 
training data generation. 
2. The user can also query that table with a *SQL query* which pulls entities. See the documentation on [feature retrieval](https://docs.feast.dev/getting-started/concepts/feature-retrieval) for details    

* Note that we include timestamps because we want the features for the same driver at various timestamps to be used in a model.

### Step 4a: Generating training data

In [5]:
from datetime import datetime
import pandas as pd

from feast import FeatureStore

# The entity dataframe is the dataframe we want to enrich with feature values
# Note: see https://docs.feast.dev/getting-started/concepts/feature-retrieval for more details on how to retrieve
# for all entities in the offline store instead
entity_df = pd.DataFrame.from_dict(
    {
        # entity's join key -> entity values
        "driver_id": [1001, 1002, 1003],
        # "event_timestamp" (reserved key) -> timestamps
        "event_timestamp": [
            datetime(2021, 4, 12, 10, 59, 42),
            datetime(2021, 4, 12, 8, 12, 10),
            datetime(2021, 4, 12, 16, 40, 26),
        ],
        # (optional) label name -> label values. Feast does not process these
        "label_driver_reported_satisfaction": [1, 5, 3],
        # values we're using for an on-demand transformation
        "val_to_add": [1, 2, 3],
        "val_to_add_2": [10, 20, 30],
    }
)

training_df = fs.get_historical_features(
    entity_df=entity_df,
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_hourly_stats:avg_daily_trips",
        "transformed_conv_rate:conv_rate_plus_val1",
        "transformed_conv_rate:conv_rate_plus_val2",
    ],
).to_df()

print("----- Feature schema -----\n")
print(training_df.info())

print()
print("----- Example features -----\n")
print(training_df.head())

----- Feature schema -----

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 10 columns):
 #   Column                              Non-Null Count  Dtype              
---  ------                              --------------  -----              
 0   driver_id                           3 non-null      int64              
 1   event_timestamp                     3 non-null      datetime64[ns, UTC]
 2   label_driver_reported_satisfaction  3 non-null      int64              
 3   val_to_add                          3 non-null      int64              
 4   val_to_add_2                        3 non-null      int64              
 5   conv_rate                           3 non-null      float32            
 6   acc_rate                            3 non-null      float32            
 7   avg_daily_trips                     3 non-null      int32              
 8   conv_rate_plus_val1                 3 non-null      float64            
 9   conv_rate_plus_val2

  if dataset.partitions is not None:
  if dataset.metadata:
  if dataset.schema is not None:
  if len(dataset.pieces) > 1:
  for piece, fn in zip(dataset.pieces, fns):
Users of the modes 'nearest', 'lower', 'higher', or 'midpoint' are encouraged to review the method they. (Deprecated NumPy 1.22)
  result = np.percentile(a2, q, interpolation=interpolation).astype(a.dtype)
Users of the modes 'nearest', 'lower', 'higher', or 'midpoint' are encouraged to review the method they. (Deprecated NumPy 1.22)
  result = np.percentile(a2, q, interpolation=interpolation).astype(a.dtype)


### Step 4b: Run offline inference (batch scoring)
To power a batch model, we primarily need to generate features with the `get_historical_features` call, but using the current timestamp

In [6]:
# entity_df["event_timestamp"] = pd.to_datetime("now", utc=True)
training_df = fs.get_historical_features(
    entity_df=entity_df,
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_hourly_stats:avg_daily_trips",
        "transformed_conv_rate:conv_rate_plus_val1",
        "transformed_conv_rate:conv_rate_plus_val2",
    ],
).to_df()

print("\n----- Example features -----\n")
print(training_df.head())


----- Example features -----

   driver_id           event_timestamp  label_driver_reported_satisfaction  \
0       1001 2021-04-12 10:59:42+00:00                                   1   
1       1002 2021-04-12 08:12:10+00:00                                   5   
2       1003 2021-04-12 16:40:26+00:00                                   3   

   val_to_add  val_to_add_2  conv_rate  acc_rate  avg_daily_trips  \
0           1            10   0.323702  0.801096              624   
1           2            20   0.963921  0.071397              767   
2           3            30   0.712921  0.167478              590   

   conv_rate_plus_val1  conv_rate_plus_val2  
0             1.323702            10.323702  
1             2.963921            20.963921  
2             3.712921            30.712921  


  if dataset.partitions is not None:
  if dataset.metadata:
  if dataset.schema is not None:
  if len(dataset.pieces) > 1:
  for piece, fn in zip(dataset.pieces, fns):
Users of the modes 'nearest', 'lower', 'higher', or 'midpoint' are encouraged to review the method they. (Deprecated NumPy 1.22)
  result = np.percentile(a2, q, interpolation=interpolation).astype(a.dtype)
Users of the modes 'nearest', 'lower', 'higher', or 'midpoint' are encouraged to review the method they. (Deprecated NumPy 1.22)
  result = np.percentile(a2, q, interpolation=interpolation).astype(a.dtype)


## Step 5: Load features into your online store

### Step 5a: Using `materialize_incremental`

We now serialize the latest values of features since the beginning of time to prepare for serving (note: `materialize_incremental` serializes all new features since the last `materialize` call).

An alternative to using the CLI command is to use Python:

```bash
CURRENT_TIME=$(date -u +"%Y-%m-%dT%H:%M:%S")
feast materialize-incremental $CURRENT_TIME
```

In [7]:
from datetime import datetime
fs.materialize_incremental(datetime.now())

  if dataset.partitions is not None:
  if dataset.metadata:
  if dataset.schema is not None:
  if len(dataset.pieces) > 1:
  for piece, fn in zip(dataset.pieces, fns):
Users of the modes 'nearest', 'lower', 'higher', or 'midpoint' are encouraged to review the method they. (Deprecated NumPy 1.22)
  result = np.percentile(a2, q, interpolation=interpolation).astype(a.dtype)
Users of the modes 'nearest', 'lower', 'higher', or 'midpoint' are encouraged to review the method they. (Deprecated NumPy 1.22)
  result = np.percentile(a2, q, interpolation=interpolation).astype(a.dtype)


Materializing [1m[32m1[0m feature views to [1m[32m2023-03-25 06:27:02+00:00[0m into the [1m[32msqlite[0m online store.

[1m[32mdriver_hourly_stats[0m from [1m[32m2023-03-24 06:27:02+00:00[0m to [1m[32m2023-03-25 06:27:02+00:00[0m:


0it [00:00, ?it/s]


### Step 5b: Inspect materialized features

Note that now there are `online_store.db` and `registry.db`, which store the materialized features and schema information, respectively.

In [9]:
print("--- Data directory ---")
!ls /mnt/shared/feast-store/data

import sqlite3
import pandas as pd
con = sqlite3.connect("/mnt/shared/feast-store/data/online.db")
print("\n--- Schema of online store ---")
print(
    pd.read_sql_query(
        "SELECT * FROM ezaf_feast_repo_driver_hourly_stats", con).columns.tolist())
con.close()

--- Data directory ---
online.db  registry.db

--- Schema of online store ---
['entity_key', 'feature_name', 'value', 'event_ts', 'created_ts']


## Step 6: Fetching real-time feature vectors for online inference

At inference time, we need to quickly read the latest feature values for different drivers (which otherwise might have existed only in batch sources) from the online feature store using `get_online_features()`. These feature vectors can then be fed to the model.

In [None]:
from pprint import pprint
from feast import FeatureStore

feature_vector = fs.get_online_features(
    features=[
        "driver_hourly_stats:acc_rate",
        "driver_hourly_stats:avg_daily_trips",
        "transformed_conv_rate:conv_rate_plus_val1",
        "transformed_conv_rate:conv_rate_plus_val2",
    ],
    entity_rows=[
        # {join_key: entity_value}
        {
            "driver_id": 1001,
            "val_to_add": 1000,
            "val_to_add_2": 2000,
        },
        {
            "driver_id": 1002,
            "val_to_add": 1001,
            "val_to_add_2": 2002,
        },
    ],
).to_dict()

pprint(feature_vector)

### Fetching features using feature services
You can also use feature services to manage multiple features, and decouple feature view definitions and the features needed by end applications. The feature store can also be used to fetch either online or historical features using the same api below. More information can be found [here](https://docs.feast.dev/getting-started/concepts/feature-retrieval).

 The `driver_activity_v1` feature service pulls all features from the `driver_hourly_stats` feature view:

```python
driver_stats_fs = FeatureService(
    name="driver_activity_v1", features=[driver_hourly_stats_view]
)
```

In [None]:
from feast import FeatureStore

feature_service = fs.get_feature_service("driver_activity_v1")
feature_vector = fs.get_online_features(
    features=feature_service,
    entity_rows=[
        # {join_key: entity_value}
        {
            "driver_id": 1001,
            "val_to_add": 1000,
            "val_to_add_2": 2000,
        },
        {
            "driver_id": 1002,
            "val_to_add": 1001,
            "val_to_add_2": 2002,
        },
    ],
).to_dict()
pprint(feature_vector)

## Step 7: Making streaming features available in Feast
Feast does not directly ingest from streaming sources. Instead, Feast relies on a push-based model to push features into Feast. You can write a streaming pipeline that generates features, which can then be pushed to the offline store, the online store, or both (depending on your needs).

This relies on the `PushSource` defined above. Pushing to this source will populate all dependent feature views with the pushed feature values.

In [None]:
from feast.data_source import PushMode

print("\n--- Simulate a stream event ingestion of the hourly stats df ---")
event_df = pd.DataFrame.from_dict(
    {
        "driver_id": [1001],
        "event_timestamp": [
            datetime(2021, 5, 13, 10, 59, 42),
        ],
        "created": [
            datetime(2021, 5, 13, 10, 59, 42),
        ],
        "conv_rate": [1.0],
        "acc_rate": [1.0],
        "avg_daily_trips": [1000],
    }
)
print(event_df)
fs.push("driver_stats_push_source", event_df, to=PushMode.ONLINE_AND_OFFLINE)