# Ride-Sharing

In this tutorial, you explore how to leverage Feast to generate and reuse/share training features, and provide feature
consistency/efficiency in near real-time model inference. In this use-case, your goal is to train a ride-sharing driver
satisfaction prediction model. Feast addresses several prevalent challenges in this pipeline:

1. **Training-serving skew and complex data joins**: Often, feature values are scattered across multiple tables. Merging
   these datasets can be a tedious, time-consuming, and error-prone task. Feast simplifies this process with proven
   logic that maintains point-in-time accuracy, preventing future feature values from leaking into your models.
1. **Online feature availability**: During inference, models frequently require features that aren't immediately
   accessible and must be derived from other data sources. Feast streamlines deployment to a range of online stores,
   such as DynamoDB, Redis, and Google Cloud Datastore. It ensures that essential features are always at hand and
   updated at the moment of inference.
1. **Feature and model versioning**: Within larger organizations, it's common for different teams to inadvertently
   duplicate feature creation logic because they can't reuse features from other projects. Plus, models have data
   dependencies that demand versioning, like when executing A/B tests on various model iterations. Feast promotes the
   discovery and sharing of previously used features and facilitates the versioning of feature groups through feature
   services.
1. **Feature Transformations**: Feast introduces lightweight feature transformations, empowering users to standardize
   transformation logic for both online and offline scenarios and across different models.

## Table of Contents

- [Feature Repository Structure](#feature-repository-structure)
- [Inspecting the Raw Data](#inspecting-the-raw-data)
- [Register Feature Definitions](#register-feature-definitions)
- [Generate a Training Dataset](#generate-training-data)
- [Generate Features for Batch Scoring](#generate-features-for-batch-scoring)
- [Ingest Batch Features into an Online Store](#ingest-batch-features-into-an-online-store)
- [Fetch Online Features for Real-time Inference](#fetch-online-features-for-real-time-inference)
- [Fetch Online Features Using a Feature Service](#fetch-online-features-using-a-feature-service)
- [Ingest Streaming Features](#ingest-streaming-features)
- [Train the Model](#train-the-model)

To get started, let's import the libraries you need and explore the Feast feature repository.

In [None]:
import os

from pathlib import Path
from datetime import datetime

import pandas as pd

from feast import FeatureStore
from feast.data_source import PushMode
from sklearn.linear_model import LinearRegression

from definitions import *

# Feature Repository Structure

The following command lists the files and directories inside the Feast feature repository:

- `data/`: This directory houses the raw demo data in the Parquet format.
- `feature_store.yaml`: This YAML file demonstrates the setup for Feast and the location of the data sources.

In [None]:
feature_repo_path = Path("/mnt/shared/feast-store")
os.listdir(feature_repo_path)

Let's take a closer look to the `feature_store.yaml` configuration file:

In [None]:
with open(feature_repo_path/"feature_store.yaml", "r") as file:
    for line in file:
        print(line, end='')

The `feature_store.yaml` file is pivotal for defining the overarching architecture of the feature store in Feast. The
provider value determines the default offline and online stores to be used.

- **Offline Store**: Serves as the compute layer to process historical data. This is crucial for the generation of
  training data and calculating feature values for serving.
- **Online Store**: Acts as a low-latency repository for the most recent feature values, facilitating real-time
  inference.

It's noteworthy that Feast is compatible with a plethora of offline and online stores. Some of these include Spark,
Azure, Hive, Trino, and PostgreSQL. Integration with these platforms is achieved via community plugins. In this case,
you use a file to keep the offline state and SQLite to implement the online store.

Let's now take a closer look to the dataset.

# Inspecting the Raw Data

In this demo, the raw feature data is stored in a local Parquet file. This dataset chronicles the hourly statistics
related to a driver's activity on a ride-sharing platform.

In [None]:
pd.read_parquet("data/driver_stats.parquet").head()

Following, the next steps in this tutorial you go through a comprehensive and typical Feast workflow. Here are the
primary steps:

1. [Register feature definitions](#register-feature-definitions)
1. [Generate a Training Dataset](#generate-training-data)
1. [Generate Features for Batch Scoring](#generate-features-for-batch-scoring)
1. [Ingest Batch Features into an Online Store](#ingest-batch-features-into-an-online-store)
1. [Fetch Online Features for Real-time Inference](#fetch-online-features-for-real-time-inference)
1. [Fetch Online Features Using a Feature Service](#fetch-online-features-using-a-feature-service)
1. [Ingest Streaming Features](#ingest-streaming-features)
1. [Train the Model](#train-the-model)

# Register Feature Definitions

In the next cell, you create a new `FeatureStore` object, by pointing it to the location of Feast feature repository.
Then, the `apply` method examines the variables you are passing to identify feature view and entity definitions.

Once found, it registers these objects and initiates the deployment of the necessary infrastructure. In your case, the
command processes the improted variables from the `definitions.py` file inside the demo's directory. It then establishes
SQLite tables for the online store. It's important to note that you had previously chosen SQLite as the default online
store by configuring the `online_store` parameter in `feature_store.yaml`.

In [None]:
fs = FeatureStore(repo_path=feature_repo_path)

fs.apply([driver, driver_stats_source, driver_stats_feature_view, driver_stats_push_source,
          driver_activity, transformed_stats, driver_stats_fresh_feature_view])

Now that you have registered the feature views and entities in your store, you can view the result in the Feast UI, by
navigating to it from your EzUA dashboard:

![feast-ui](images/feast-ui.png)

# Generate Training Data

To train a model, both features and labels are essential. Frequently, labels are stored separately from the features.
For instance, while one table might store user survey results, another set of tables may contain the feature values.
Feast simplifies the process of mapping these features to the corresponding labels.

To achieve this, Feast requires a list of entities (such as driver IDs) accompanied by timestamps. With this
information, Feast intelligently joins the pertinent tables to produce the appropriate feature vectors. There are two
primary methods to generate this list:

1. Users can query the labels table, which contains timestamps, and then feed this data into Feast as an entity
   dataframe. This approach is useful for generating training data.
1. Alternatively, users can utilize a SQL query to extract entities from the table.

> It's crucial to incorporate timestamps since the objective is to leverage features corresponding to the same driver
> at different moments for model training.

In [None]:
entity_df = pd.DataFrame.from_dict({
        # entity's join key -> entity values
        "driver_id": [1001, 1002, 1003],
        # "event_timestamp" (reserved key) -> timestamps
        "event_timestamp": [
            datetime(2021, 4, 12, 10, 59, 42),
            datetime(2021, 4, 12, 8, 12, 10),
            datetime(2021, 4, 12, 16, 40, 26),
        ],
        # (optional) label name -> label values. Feast does not process these.
        "label_driver_reported_satisfaction": [1, 5, 3],
    }
)

# on demand transformations give us the last two features
# `conv_plus_trips` and `acc_plus_trips`
training_df = fs.get_historical_features(
    entity_df=entity_df,
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_hourly_stats:avg_daily_trips",
        "transformed_stats:conv_plus_trips",
        "transformed_stats:acc_plus_trips",
    ],
).to_df()

print("----- Feature schema -----\n")
print(training_df.info())

print()
print("----- Example features -----\n")
print(training_df.head())

# Generate Features for Batch Scoring

To power a batch model, the primary requirement is to generate features. This is achieved using the
`get_historical_features` call. However, it's essential to use the current timestamp for this process.

In [None]:
# entity_df["event_timestamp"] = pd.to_datetime("now", utc=True)
training_df = fs.get_historical_features(
    entity_df=entity_df,
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_hourly_stats:avg_daily_trips",
        "transformed_stats:conv_plus_trips",
        "transformed_stats:acc_plus_trips",
    ],
).to_df()

print("\n----- Example features -----\n")
print(training_df.head())

# Ingest Batch Features into an Online Store

To prepare for serving, you serialize the most recent feature values spanning back to their inception.

In [None]:
fs.materialize(
    datetime(2021, 4, 12, 10, 59, 42), datetime.now(), ["driver_hourly_stats"])

# Fetch Online Features for Real-time Inference

During inference, it's essential to rapidly access the latest feature values for various drivers. Without an online
feature store, these values might solely reside in batch sources. By utilizing the `get_online_features()` function, you
can swiftly retrieve these values from the online feature store. Once obtained, these feature vectors are ready to be
fed into the model.

In [None]:
feature_vector = fs.get_online_features(
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_hourly_stats:avg_daily_trips",
    ],
    entity_rows=[
        # {join_key: entity_value}
        {"driver_id": 1004},
        {"driver_id": 1005},
    ],
).to_dict()

print(feature_vector)

# Fetch Online Features Using a Feature Service

Feature services offer a way to manage multiple features, allowing for a decoupling between feature view definitions and
the features required by end applications. Additionally, the feature store provides the flexibility to fetch either
online or historical features using the API outlined below.

In [None]:
feature_service = fs.get_feature_service("driver_activity")

feature_vector = fs.get_online_features(
    features=feature_service,
    entity_rows=[
        {"driver_id": 1004},
        {"driver_id": 1005},
    ],
).to_dict()

print(feature_vector)

# Ingest Streaming Features

Feast doesn't natively support ingestion from streaming sources. Instead, it operates on a push-based model where
features are actively pushed into Feast. You can create a streaming pipeline dedicated to feature generation, and
subsequently push these features to either the offline store, the online store, or both, based on your requirements.

This approach is contingent on the `PushSource` defined earlier. Pushing data to this source ensures that all connected
feature views are populated with the newly pushed feature values.

In [None]:
event_df = pd.DataFrame.from_dict(
    {
        "driver_id": [1001],
        "event_timestamp": [
            datetime(2023, 5, 13, 10, 59, 42),
        ],
        "created": [
            datetime(2023, 5, 13, 10, 59, 42),
        ],
        "conv_rate": [1.0],
        "acc_rate": [1.0],
        "avg_daily_trips": [1000],
    }
)

fs.push("driver_stats_push_source", event_df, to=PushMode.ONLINE_AND_OFFLINE)

Retrieve the data you pushed to the sreaming source:

In [None]:
temp_entity = pd.DataFrame.from_dict({
        # entity's join key -> entity values
        "driver_id": [1001],
        # "event_timestamp" (reserved key) -> timestamps
        "event_timestamp": [
            datetime(2023, 5, 13, 10, 59, 42)
        ],
    }
)

# on demand transformations give us the last two features
# `conv_plus_trips` and `acc_plus_trips`
temp_df = fs.get_historical_features(
    entity_df=temp_entity,
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_hourly_stats:avg_daily_trips",
        "transformed_stats:conv_plus_trips",
        "transformed_stats:acc_plus_trips",
    ],
).to_df()

temp_df.head()

# Train the Model

Finally, you can use the training dataset you created before to train a simple linear regression model.

In [None]:
target = "label_driver_reported_satisfaction"

train_X = training_df[training_df.columns.drop(target).drop("event_timestamp")]
train_Y = training_df.loc[:, target]

reg = LinearRegression()
reg.fit(train_X[sorted(train_X)], train_Y)