# Step 0: Install packages


In [26]:
!pip install feast==0.29.0
!pip install scikit-learn



In [27]:
!feast version

Feast SDK Version: "feast 0.29.0"


# Step 1: Exploring the data 
This is a set of time-series data with `driver_id` as the primary key (representing the driver entity) and `event_timestamp` as showing when the event happened.

In [None]:
import pandas as pd
pd.read_parquet("infra/driver_stats.parquet")

Unnamed: 0,event_timestamp,driver_id,conv_rate,acc_rate,created,day,miles_driven,daily_miles_driven
0,2022-02-27 15:00:00+00:00,1005,0.142214,0.408987,2022-03-14 15:21:25.842,2022-02-27,24.817618,150.587752
1,2022-02-27 16:00:00+00:00,1005,0.349267,0.734021,2022-03-14 15:21:25.842,2022-02-27,8.293352,150.587752
2,2022-02-27 17:00:00+00:00,1005,0.358805,0.366804,2022-03-14 15:21:25.842,2022-02-27,8.316672,150.587752
3,2022-02-27 18:00:00+00:00,1005,0.611828,0.773883,2022-03-14 15:21:25.842,2022-02-27,10.566458,150.587752
4,2022-02-27 19:00:00+00:00,1005,0.156503,0.966413,2022-03-14 15:21:25.842,2022-02-27,15.664332,150.587752
...,...,...,...,...,...,...,...,...
1802,2022-03-14 11:00:00+00:00,1001,0.667961,0.211051,2022-03-14 15:21:25.842,2022-03-14,40.500185,350.650257
1803,2022-03-14 12:00:00+00:00,1001,0.209861,0.672022,2022-03-14 15:21:25.842,2022-03-14,21.341177,350.650257
1804,2022-03-14 13:00:00+00:00,1001,0.215754,0.791849,2022-03-14 15:21:25.842,2022-03-14,8.068214,350.650257
1805,2022-03-14 14:00:00+00:00,1001,0.404588,0.407571,2022-03-14 15:21:25.842,2022-03-14,46.533292,350.650257


# Step 1: Setup the feature repo
The first thing a platform team needs to do is setup a feature_store.yaml file within a version controlled repo like GitHub. `feature_store.yaml` is the primary way to configure an overall Feast project. We've setup a sample feature repository in `feature_repo_local/`

## Step 1a: Run feast plan
With the `feature_store.yaml` setup, you can now run feast plan to see what changes would happen with `feast apply`.



In [6]:
!cd /home/jovyan/feature_repo_local/ && feast plan

Created entity [1m[32mdriver[0m
Created feature view [1m[32mdriver_hourly_stats[0m
Created feature service [1m[32mmodel_v1[0m
Created feature service [1m[32mmodel_v2[0m

[1m[94mNo changes to infrastructure


## Step 1b: Run feast apply
Now run `feast apply`.

This will parse the feature, data source, and feature service definitions and publish them to the registry. It may also setup some tables in the online store to materialize batch features to (in this case, we set the online store to null so no online store changes will occur).

In [8]:
!cd /home/jovyan/feature_repo_local/ && feast apply

Created entity [1m[32mdriver[0m
Created feature view [1m[32mdriver_hourly_stats[0m
Created feature service [1m[32mmodel_v1[0m
Created feature service [1m[32mmodel_v2[0m

Deploying infrastructure for [1m[32mdriver_hourly_stats[0m


## Step 1c: Verify features are registered
You can now run Feast CLI commands to verify Feast knows about your features and data sources.



In [9]:
!cd /home/jovyan/feature_repo_local/ && feast feature-views list

NAME                 ENTITIES    TYPE
driver_hourly_stats  {'driver'}  FeatureView


# Step 2 ML Engineers fetch features
ML engineers can use the defined FeatureService (corresponding to model versions) and schedule regular jobs that generate batch predictions (or regularly retrain).

## Step 0: Understanding `get_historical_features` and feature services
`get_historical_features` is an API by which you can retrieve features (by referencing features directly or via feature services). It will under the hood manage point-in-time joins and avoid data leakage to generate training datasets or power batch scoring.

For batch scoring, you want to get the latest feature values for your entities. Feast requires timestamps in `get_historical_features`, so what you'll need to do is append an event timestamp of now(). Don't bother running this code right now since we'll run this in the next step.

In [16]:
import pandas as pd
from feast import FeatureStore
from feast.repo_config import RegistryConfig

# Get the latest feature values for unique entities
entity_df = pd.DataFrame.from_dict({"driver_id": [1001, 1002, 1003, 1004, 1005],})
entity_df["event_timestamp"] = pd.to_datetime('now', utc=True)

# Connect to your feature store provider
store = FeatureStore(repo_path="/home/jovyan/feature_repo_local")

# Because we're using the default FileOfflineStore, this executes on your machine
training_df = store.get_historical_features(
    entity_df=entity_df, 
    features=store.get_feature_service("model_v2"),
).to_df()

print("----- Feature schema -----\n")
print(training_df.info())

print()
print("----- Example features -----\n")
print(training_df.head())

----- Feature schema -----

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype              
---  ------           --------------  -----              
 0   driver_id        5 non-null      int64              
 1   event_timestamp  5 non-null      datetime64[ns, UTC]
 2   conv_rate        5 non-null      float32            
 3   acc_rate         5 non-null      float32            
dtypes: datetime64[ns, UTC](1), float32(2), int64(1)
memory usage: 248.0 bytes
None

----- Example features -----

   driver_id                  event_timestamp  conv_rate  acc_rate
0       1002 2023-02-21 05:49:57.993076+00:00   0.465875  0.315721
1       1005 2023-02-21 05:49:57.993076+00:00   0.394072  0.046118
2       1003 2023-02-21 05:49:57.993076+00:00   0.869917  0.779562
3       1001 2023-02-21 05:49:57.993076+00:00   0.404588  0.407571
4       1004 2023-02-21 05:49:57.993076+00:00   0.977276  0.051582


## Step 1: Fetch feature by `driver_orders.csv` data to train model



In [21]:
import feast
from joblib import dump
import pandas as pd
from sklearn.linear_model import LinearRegression

# Load driver order data, when orders give to entity_df, it shows 0 entries  
orders = pd.read_csv("driver_orders.csv", sep="\t")
orders["event_timestamp"] = pd.to_datetime(orders["event_timestamp"])
print(orders)

            event_timestamp  driver_id  trip_completed
0 2021-04-16 20:29:28+00:00       1001               1
1 2021-04-17 04:29:28+00:00       1002               0
2 2021-04-17 12:29:28+00:00       1003               0
3 2021-04-17 20:29:28+00:00       1001               1
4 2021-04-18 04:29:28+00:00       1002               0
5 2021-04-18 12:29:28+00:00       1003               0
6 2021-04-18 20:29:28+00:00       1001               1
7 2021-04-19 04:29:28+00:00       1002               0
8 2021-04-19 12:29:28+00:00       1003               0
9 2021-04-19 20:29:28+00:00       1004               1


In [19]:
# Connect to your feature store provider
store = FeatureStore(repo_path="/home/jovyan/feature_repo_local")

# Because we're using the default FileOfflineStore, this executes on your machine
training_df = store.get_historical_features(
    entity_df=orders, 
    features=store.get_feature_service("model_v2"),
).to_df()

print("----- Feature schema -----\n")
print(training_df.info())

print()
print("----- Example features -----\n")
print(training_df.head())

----- Feature schema -----

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype              
---  ------           --------------  -----              
 0   event_timestamp  10 non-null     datetime64[ns, UTC]
 1   driver_id        10 non-null     int64              
 2   trip_completed   10 non-null     int64              
 3   conv_rate        10 non-null     float32            
 4   acc_rate         10 non-null     float32            
dtypes: datetime64[ns, UTC](1), float32(2), int64(2)
memory usage: 448.0 bytes
None

----- Example features -----

            event_timestamp  driver_id  trip_completed  conv_rate  acc_rate
0 2021-04-16 20:29:28+00:00       1001               1   0.521149  0.751659
1 2021-04-17 04:29:28+00:00       1002               0   0.089014  0.212637
2 2021-04-17 12:29:28+00:00       1003               0   0.188855  0.344736
3 2021-04-17 20:29:28+00:00       1001        

In [22]:
# Train model
target = "trip_completed"

reg = LinearRegression()
train_X = training_df[training_df.columns.drop(target).drop("event_timestamp")]
train_Y = training_df.loc[:, target]
reg.fit(train_X[sorted(train_X)], train_Y)

# Save model
dump(reg, "driver_model.bin")

['driver_model.bin']