# Implementing a Feature Store to predict loan eligibility


## Feature Store Explainer

### What is a feature store?


Before we dive into what a feature store is, quick refresher: in machine learning, a feature is data used as input in a predictive model. It is the x in f(x) = y

A feature store is an ML-specific system that:

- Transform raw data into feature values for use by ML models - think a data pipeline
- Stores and manages this feature data, and
- Serves feature data consistently for training and inference purposes

#### What problem are feature stores trying to solve?

Feature stores are trying to solve 3 problems:

* When an ML model is trained on preprocessed data, it is necessary to carry out the identical steps on incoming prediction requests. This is because we need to provide the model data with the same characteristics as the data it was trained on. If we don’t do that, we will get a difference between training and serving, and the model predictions will not be as good.
* Many companies will use the same features across a variety of models and so it is a central hub for those features to be used by many models. Feature stores make sure there is no repetitive engineering setup as well as different pre-processing steps for the same features
* It takes care of the engineering burden making sure features are pre-loaded into low-latency storage without the engineering work as well as making sure that these features were calculated the same way

### When to use a feature store



In most cases feature stores add unnecessary complexity and are well suited for specific ML uses cases. You might even be asking, "If a feature store is simply making sure the same pre-processing happens on the data, why can't I do that transformation during inference on the raw data?"

There are two scenarios that it isn't viable:


*   The first situation is if the feature value will not be known by clients requesting predictions, but has to instead be computed on the server. If the clients requesting predictions will not know the feature values, then we need a mechanism to inject the feature values into incoming prediction requests. The feature store plays that role. For example, one of the features of a dynamic pricing model may be the number of web site visitors to the item listing over the past hour. The client (think of a mobile app) requesting the price of a hotel will not know this feature’s value. This information has to be computed on the server using a streaming pipeline on clickstream data and inserted into the feature store. You can also imagine that if you have to fetch a alot of data, this cannot be done quick enough.

* The second situation is to prevent unnecessary copies of the data. For example, consider that you have a feature that is computationally expensive and is used in multiple ML models. Rather than using a transform function and storing the transformed feature in multiple ML training datasets, it is much more efficient and maintainable to store it in a centralized repository.

To summarize, a feature-store is most valuable when:

* A feature is unknown by user and needs to be fetched/computed server-side
* A feature requires intensive computation
* A Feature is used by many different models

Okay, thats enough english for now. Lets get to the building since I prefer to speak in code

## Tutorial





Throughout this tutorial, we’ll walk through the creation of a production-ready fraud prediction system, end to end. We will be predicting whether a transaction made by a given user will be fraudulent. This prediction will be made in real-time as the user makes the transaction, so we need to be able to generate a prediction at low latency.

Our system will perform the following workflows:
- Computing and backfilling feature data from raw data
- Building point-in-time correct training datasets from feature data and training a model
- Making online predictions from feature data

We will be using a open-source framework called Feast which is built from the guys a Tecton, one of the leading feature-store companies globally. Tecton is a hosted version of Feast and comes with a few more beneficial features such as monitoring. We will then be deploying our application to AWS.

If you don't have it, download the data required for this tutorial from [here](https://drive.google.com/file/d/1MidRYkLdAV-i0qytvsflIcKitK4atiAd/view?usp=sharing). This is originally from a [Kaggle dataset](https://www.kaggle.com/competitions/ieee-fraud-detection/data) for Fraud Detection. Place this dataset in a `data` directory in the root of your project. You can run this notebook either in VS Code or Jupyter Notebooks.

We're going to convert this dataset into a format that Feast can understand, a parquet file. We also need to add 2 columns, `event_timestamp` and `created_timestamp`, so that feast can index the data time.

In [45]:
import pandas as pd
from datetime import datetime

df = pd.read_csv('data/train_transaction.csv')
df["TransactionDT"] = df["TransactionDT"]/df["TransactionDT"].max()

start = datetime(2021, 1, 1).timestamp()
end = datetime(2022, 1, 1).timestamp()

df["event_timestamp"] = pd.to_datetime(df["TransactionDT"].apply(lambda x: round(start + x * (end - start))), unit='s')
df["created_timestamp"] = df["event_timestamp"].copy()

df = df[["ProductCD", "TransactionAmt", "P_emaildomain", "R_emaildomain", "card4", "M1", "M2", "M3", "created_timestamp", "event_timestamp", "isFraud"]]
df.to_parquet('data/train_transaction.parquet')

### Setup AWS Infrastructure


Since infrastructure and architecture are not the purpose of this tutorial we will use [Terraform](https://www.terraform.io) to quickly setup our infrastructure in AWS to continue with the rest of the tutorial.

Without deviating too much let me just explain quickly what terraform is and the different components we set up:


*   Terraform is a infrastructure as code tool that allows you to create and change infrastructure predictably. In plain english, think of it as a setup definition file and with one command you can create a development and production environment that are exact replicas of eachother.

The following is created from the terraform file:

*   **S3 bucket** - this is where we are storing our data files to be using in this tutorial
* **Redshift cluster** - this is the AWS data warehouse we will be using
* **AWS Glue** - this is the AWS ELT tool that we will use to get our data from S3 to redshit.
* **AWS IAM Roles** - We create the roles thats needed for these 3 resources to interact.

Okay enough geeking out on Terraform - lets code!


We need to setup our AWS credentials in order to deploy this terraform setup to our account. To start make sure you have your AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables setup. If not, go to your AWS console and follow the instructions below:

*   Go to the IAm service
*   Click "*Users*" in the sidebar
*   Go through the steps to creat a user and attach the following policies below.

If you already have a user, make sure you have the following permissions:

*   AmazonRedshiftDataFullAccess
*   AmazonS3FullAccess
*   AWSGlueConsoleFullAccess
*   IAMFullAccess

Once a user is created, you can click on your user and go to the tab that says "*Security Credentials*". Scroll down and click the button that says "Create access key". You should then see a *Access Key* and *Secret Key* generated for you.

Run the code below pasting in the generated keys


In [1]:
!export AWS_ACCESS_KEY_ID=AKIAUGLYMD63DNEJJYV5
!export AWS_SECRET_ACCESS_KEY=Lyv3R94saoacYLWG0pAkNVzUkNw3FtVcbhi4m

Install the Terraform framework. We use Homebrew on macOS but you may install it however you like.

In [2]:
!brew install terraform

Running `brew update --auto-update`...
[34m==>[0m [1mAuto-updated Homebrew![0m
Updated 3 taps (homebrew/core, homebrew/cask and homebrew/cask-fonts).
[34m==>[0m [1mNew Formulae[0m
create-api      ksh93           prr             pymupdf         snapcast
[34m==>[0m [1mNew Casks[0m
v2ray-unofficial

You have [1m2[0m outdated formulae installed.
You can upgrade them with [1mbrew upgrade[0m
or list them with [1mbrew outdated[0m.

[34m==>[0m [1mDownloading https://ghcr.io/v2/homebrew/core/terraform/manifests/1.2.6[0m
######################################################################## 100.0%
[34m==>[0m [1mDownloading https://ghcr.io/v2/homebrew/core/terraform/blobs/sha256:e540a9b5[0m
[34m==>[0m [1mDownloading from https://pkg-containers.githubusercontent.com/ghcr1/blobs/sh[0m
######################################################################## 100.0%
[34m==>[0m [1mPouring terraform--1.2.6.arm64_monterey.bottle.tar.gz[0m
🍺  /opt/homebrew/Cellar/terraf

In your terminal, go to the "infra" folder that came along with this tutorial. Run the command below

In [2]:
#Run the following in the infra folder
!terraform init
!export TF_VAR_region="us-west-2" #region to deploy this
!export TF_VAR_project_name="fraud-classifier" #setup a project nam!e
!terraform apply -var="admin_password=thisISTestPassword1"

Once your infrastructure is deployed you should see the following output in your terminal. Save these, we will need them later.

In [None]:
redshift_cluster_identifier = "my-feast-project-aws-cerebrium-redshift-cluster"
redshift_spectrum_arn = "arn:aws:iam::ACCOUNT_NUMBER:role/s3_spectrum_role"
credit_history_table = "credit_history"
zipcode_features_table = "zipcode_features"

In [3]:
#run the following in your terminal
aws redshift-data execute-statement \
--region us-west-2 \
--cluster-identifier [SET YOUR redshift_cluster_identifier HERE] \
--db-user admin \
--database dev \
--sql "create external schema spectrum from data catalog database 'dev' iam_role '[SET YOUR redshift_spectrum_arn here]' create external database if not exists;"

/bin/bash: aws: command not found


You should then get a JSON result back. Enter the id returned below

In [None]:
aws redshift-data describe-statement --id [SET YOUR STATEMENT ID HERE] --region us-west-2


If that is all running successfully then we are done with our AWS setup! 

### Feast

To get started, let us install the Feast framework. Feast can be installed using pip. Run this command in your command line instead of running it in the notebook.


In [1]:
pip install feast


Collecting feast
  Downloading feast-0.23.0.tar.gz (3.0 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting tenacity<9,>=7
  Downloading tenacity-8.0.1-py3-none-any.whl (24 kB)
Collecting proto-plus<2,>=1.20.0
  Downloading proto_plus-1.20.6-py3-none-any.whl (46 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.4/46.4 KB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
Collecting google-api-core<3,>=1.23.0
  Downloading google_api_core-2.8.2-py3-none-any.whl (114 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.6/114.6 KB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
Collecti

In Feast, you define your features using a .yaml file in a repository. To create a repository run the command below. This will create a few files that are mostly example files (you can delete driver_repo.py and test.py if you would like) but we only care about:

*   **example.py**: This is a python file where you define your feature values and where Feast will find them. i.e: A Redshift Cluster or a S3 bucket.
*   **feature_store.yaml**: This is a configuration file where you will define the location of your Redshift cluster, S3 bucket and DynamoDB Database.


Since we are using AWS, we have use *aws* in the command however if you were using Google Cloud you can use *gcp*. Run the commands below in your terminal, if you run it in this notebook it will create the folders in here. Create a folder and change to it in your command line and run the command below:

In [2]:
!feast init -t aws feature_repo # Command only shown for reference.

AWS Region (e.g. us-west-2): 
AWS Region (e.g. us-west-2): 
AWS Region (e.g. us-west-2): us-west-2
Redshift Cluster ID: my-feast-project-aws-cerebrium-redshift-cluster
Redshift Database Name: dev
Redshift User Name: admin
Redshift S3 Staging Location (s3://*): 
Aborted!


Create a file called example.py in which we will define our features. Before we get started, we need to understand the concept of an Entity and FeatureView:

*   **Entity**: An entity is a collection of semantically related features. For example, Uber would have customers and drivers as two seperate entities that group features that correspond to those entities.
*   **FeatureView**: A feature view is an object that represents a logical group of time-series feature data as it is found in a data source. They consist of zero or more entities, one or more features and a data source.



In [None]:
from datetime import timedelta
from feast import (Entity, Feature, FeatureView, RedshiftSource,
                   ValueType)

transaction = Entity(name="transactions", value_type=ValueType.STRING)

transaction_source = RedshiftSource(
    query="SELECT * FROM spectrum.transaction_features",
    event_timestamp_column="event_timestamp",
    created_timestamp_column="created_timestamp",
)

transaction_features = FeatureView(
    name="transaction_features",
    entities=["transaction"],
    ttl=timedelta(days=365),
    features=[
        Feature(name="ProductCD", dtype=ValueType.STRING),
        Feature(name="P_emaildomain", dtype=ValueType.STRING),
        Feature(name="R_emaildomain", dtype=ValueType.STRING),
        Feature(name="card4", dtype=ValueType.STRING),
        Feature(name="M1", dtype=ValueType.STRING),
        Feature(name="M2", dtype=ValueType.STRING),
        Feature(name="M3", dtype=ValueType.STRING)
    ],
    batch_source=transaction_source,
)

First we create our *transaction* entity and define the SQL that will fetch the required features from our Redshift data warehouse. We then create a featureView that uses the Redshift instance to fetch the features and define the data type for each feature. We also define the time we would like the feature to contain. In this case we want 1 year worth of data which is 365 days.

Next, we'll edit the `feature_store.yaml` file to reference our Redshift cluster and s3 bucket - these are the values you got returned when terraform finished running. Below is what the various fields mean:

* **project**: The name you would like to call the project.
* **registry**: The registry is a central catalog of all the feature definitions and their related metadata. It is a file that you can interact with through the Feast API
*   **provider**: The cloud provider you are using - in our case AWS
*   **online_store**: The Online store is used for low-latency online feature value lookups. Feature values are loaded into the online store from data sources. Online stores only hold the latest values per entity key. An online store would be something such as Redis or DynamoDB - low latency.
* **offline_store**: The offline stores store historic feature values and does not generate these values. The offline store is used as the interface for querying existing features or loading these features into an online store for low latency prediction. An offline store would be something like a data warehouse or storage bucket - high latency and a alot of historical data.


Your S3 bucket name you fill in below should be something along the lines of s3://my-feast-project-aws-cerebrium-bucket where cerebrium should be replaced with your project name.

In [None]:
project: credit_scoring_aws
registry: registry.db
provider: aws
online_store:
    type: dynamodb
    region: us-west-2
offline_store:
    type: redshift
    cluster_id: [SET YOUR CLUSTER ID]
    region: us-west-2
    database: dev
    user: admin
    s3_staging_location: s3://[SET YOUR BUCKET NAME]
    iam_role: [SET YOUR ARN]


Deploy the feature store by running apply from within the feature/ folder.

In [None]:
!feast apply

If everything was created correctly, you would have seen the following output:

In [None]:
#Created entity zipcode
#Created entity dob_ssn
#Created feature view credit_history
#Created feature view zipcode_features

#Deploying infrastructure for credit_history
#Deploying infrastructure for zipcode_features

Next we load our features into the online store using the materialize-incremental command. This command will load the latest feature values from a data source into the online store from the last materialize call.

In [None]:
CURRENT_TIME=$(date -u +"%Y-%m-%dT%H:%M:%S")
feast materialize-incremental $CURRENT_TIME

If successful, you should see some activity in your terminal that its uploading the features. Once completed, you should see the results in our DynamoDB instance on AWS. 

### Model Building

In the repo, we have two files with respect to our model:

*   *run.py*: This is a helper file that is going through the full model workflow. It fetches the historical loan data, trains our model and then makes a prediction to determine if the loan was approved or not.
*   *credit_model.py*: This file shows you how we use Feast during our model building as well as during our inference. 

We will go through the code below

In [None]:
#run.py
import boto3
import pandas as pd

#Import Class that contains most of the functionality
from fraud_detection_model import FraudClassifierModel

# Get historic transaction data
transactions = pd.read_parquet("data/train_transaction.parquet")

# Create model
model = FraudClassifierModel

# Train model (using Redshift for zipcode and credit history features)
if not model.is_model_trained():
    model.train(transactions)

# Make online prediction (using DynamoDB for retrieving online features)
transaction_request = {
    "ProductCD": ["W"],
    "P_emaildomain": ["live"],
    "R_emaildomain": [None],
    "card4": ["visa"],
    "M1": ["T"],
    "M2": ["T"],
    "M3": ["T"]
}

result = model.predict(transaction_request)

if result == 0:
    print("Non-Fraudeluent Transaction")
elif result == 1:
    print("Fradulent Transaction")

In the credit_model.py we won't go through the entire file but rather just snippets in the file. 

We start by defining our model features which we do by specifiying the [entity name]: [column name].

In [None]:
#line 21
feast_features = [
        "transaction_features:ProductCD",
        "transaction_features:isFraud",
        "transaction_features:P_emaildomain",
        "transaction_features:R_emaildomain",
        "transaction_features:card4",
        "transaction_features:M1",
        "transaction_features:M2",
        "transaction_features:M3"
    ]

During the initialisation of our model we attach the feature store to our model object to use later. The repo path is where the folder that contains our feature_store.yaml and example.py that we created above - Feast fetches the configuration from there.

In [None]:
#57
self.fs = feast.FeatureStore(repo_path="feature_repo")

When we would like to train our model, we want to get the historical data relating to our features. The method below launches a job that executes a join of features from the offline store onto the entity dataframe. 

An entity dataframe is the target dataframe on which you would like to join feature values. The entity dataframe must contain a timestamp column called event_timestamp and all entities (primary keys) necessary to join feature tables onto. All entities found in feature views that are being joined onto the entity dataframe must be found as column on the entity dataframe.

Once completed, a job reference will be returned. This job reference can then be converted to a Pandas dataframe by calling to_df().

In [None]:
#line 66
training_df = self.fs.get_historical_features(
            entity_df=loans, features=self.feast_features
        ).to_df()

When we do online inference (prediction) using our model, we don't want to have to fetch all the historical data or anything really from our data warehouse since that will take multiple seconds. Rather we want to get the data we need from a low-latency data-source so we can have a low response time (~100ms). We do that below with the get_online_features function.

In [None]:
#line 123
return self.fs.get_online_features(
    entity_rows=[{"transaction": transaction}],
    features=self.feast_features,
).to_dict()

The above allows me to pass in the specific transaction and get the feature values for this user instantaneously. We can then use these values in our predict *function* to return what we predicted for the loan

Now let us run our run.py file to see this live and the output of our model

In [2]:
python run.py

#Output

#loan rejected!

python3: can't open file 'run.py': [Errno 2] No such file or directory


## Conclusion

That's it for our tutorial on feature stores! As I am sure you can tell, feature stores can add a lot of value to your ML infrastructure when it comes to using the same features across multiple models as well as doing server-side feature calculations however can add some complxity. Using Feast is great to implement this but if you want a more managed approach with extra functionality such as identifying model drift then you can try Tecton, or the the features stores that are native to the AWS and Google platforms.