### 1. Prepare data set and store in parquet format

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('https://raw.githubusercontent.com/TripathiAshutosh/feast/main/Feast%20Live%20Demo/diabetes.csv')

In [3]:
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
predictors_df = data.loc[:,data.columns!='Outcome']
target_df = data['Outcome']

In [6]:
predictors_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33


**Create timestamp to be added as event_timestamp column in the data set.**

In [7]:
timestamps = pd.date_range(end = pd.Timestamp.now(),
                           periods = len(data),freq = 'D').to_frame(name = 'event_timestamp', index = False)

In [8]:
timestamps

Unnamed: 0,event_timestamp
0,2020-05-21 22:06:39.689804
1,2020-05-22 22:06:39.689804
2,2020-05-23 22:06:39.689804
3,2020-05-24 22:06:39.689804
4,2020-05-25 22:06:39.689804
...,...
763,2022-06-23 22:06:39.689804
764,2022-06-24 22:06:39.689804
765,2022-06-25 22:06:39.689804
766,2022-06-26 22:06:39.689804


**add event_timestamp column to the predictors and target dataframes**

In [9]:
predictors_df = pd.concat(objs = [predictors_df, timestamps], axis = 1)
target_df = pd.concat(objs = [target_df, timestamps], axis =1)

In [10]:
predictors_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,event_timestamp
0,6,148,72,35,0,33.6,0.627,50,2020-05-21 22:06:39.689804
1,1,85,66,29,0,26.6,0.351,31,2020-05-22 22:06:39.689804
2,8,183,64,0,0,23.3,0.672,32,2020-05-23 22:06:39.689804
3,1,89,66,23,94,28.1,0.167,21,2020-05-24 22:06:39.689804
4,0,137,40,35,168,43.1,2.288,33,2020-05-25 22:06:39.689804


In [11]:
target_df.head()

Unnamed: 0,Outcome,event_timestamp
0,1,2020-05-21 22:06:39.689804
1,0,2020-05-22 22:06:39.689804
2,1,2020-05-23 22:06:39.689804
3,0,2020-05-24 22:06:39.689804
4,1,2020-05-25 22:06:39.689804


**Create a patientID column to uniquely identify records with patientID and timestamp field together.**

In [12]:
dataLen = len(data)
idsList = list(range(dataLen))

In [3]:
#idsList

In [14]:
patient_ids = pd.DataFrame(data = idsList, columns = ['patient_id'])

In [4]:
#patient_ids

In [16]:
predictors_df = pd.concat(objs = [predictors_df, patient_ids], axis = 1)
target_df = pd.concat(objs = [target_df, patient_ids], axis =1)

In [17]:
predictors_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,event_timestamp,patient_id
0,6,148,72,35,0,33.6,0.627,50,2020-05-21 22:06:39.689804,0
1,1,85,66,29,0,26.6,0.351,31,2020-05-22 22:06:39.689804,1
2,8,183,64,0,0,23.3,0.672,32,2020-05-23 22:06:39.689804,2
3,1,89,66,23,94,28.1,0.167,21,2020-05-24 22:06:39.689804,3
4,0,137,40,35,168,43.1,2.288,33,2020-05-25 22:06:39.689804,4


In [23]:
predictors_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   Pregnancies               768 non-null    int64         
 1   Glucose                   768 non-null    int64         
 2   BloodPressure             768 non-null    int64         
 3   SkinThickness             768 non-null    int64         
 4   Insulin                   768 non-null    int64         
 5   BMI                       768 non-null    float64       
 6   DiabetesPedigreeFunction  768 non-null    float64       
 7   Age                       768 non-null    int64         
 8   event_timestamp           768 non-null    datetime64[ns]
 9   patient_id                768 non-null    int64         
dtypes: datetime64[ns](1), float64(2), int64(7)
memory usage: 60.1 KB


In [18]:
predictors_df.to_parquet(path='predictors_df.parquet')
target_df.to_parquet(path='target_df.parquet')

In [None]:
#!pip install feast

In [19]:
!feast version

Feast SDK Version: "feast 0.21.3"


### 2. Do feast init

this is option as it creates the feast repo directory structure. you can create a directory using mkdir and inside that create a feature_Store.yaml file and a feature_definitions.py file. but its better to use feast init and then modify the respective files.

In [21]:
!feast init feature_repo


Creating a new Feast repository in C:\Users\Ashutosh Tripathi\Documents\feast\Feast Live Demo\feature_repo.



  for dt in pd.date_range(


### 3. Update feature store yaml file if needed

you can update the online store and local store paths in feature_store.yaml file if needed.

### 4. Define Feature definitions in a python file inside feature repo directory (created using feast init)
This step is known as register and deploy the features
go inside the feature_repo folder in the github you will see the feature_definition.py file with updated code. modify as per to your dataset features.

### 5. Do feast apply

do feast apply from inside the feature_repo directory

In [24]:
pwd

'C:\\Users\\Ashutosh Tripathi\\Documents\\feast\\Feast Live Demo'

In [25]:
cd feature_repo

C:\Users\Ashutosh Tripathi\Documents\feast\Feast Live Demo\feature_repo


In [28]:
!feast apply

Created entity patient_id
Created feature view predictors_df_feature_view
Created feature view target_df_feature_view

Created sqlite table feature_repo_predictors_df_feature_view
Created sqlite table feature_repo_target_df_feature_view





### 6. Generate Training Data Set

In [35]:
from feast import FeatureStore
from feast.infra.offline_stores.file_source import SavedDatasetFileStorage

store = FeatureStore(repo_path='.')

entity_df = pd.read_parquet(path = 'data/target_df.parquet')

training_data = store.get_historical_features(
entity_df = entity_df,
    features = [
        "predictors_df_feature_view:Pregnancies",
        "predictors_df_feature_view:Glucose",
        "predictors_df_feature_view:BloodPressure",
        "predictors_df_feature_view:SkinThickness",
        "predictors_df_feature_view:Insulin",
        "predictors_df_feature_view:BMI",
        "predictors_df_feature_view:DiabetesPedigreeFunction",
        "predictors_df_feature_view:Age",
               ]
)

dataset = store.create_saved_dataset(
from_=training_data,
    name = "diabetes_dataset",
    storage = SavedDatasetFileStorage('data/diabetes_dataset.parquet')
)



In [36]:
training_data.to_df()

Unnamed: 0,Outcome,event_timestamp,patient_id,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,1,2020-05-21 22:06:39.689804+00:00,0,6,148,72,35,0,33.6,0.627,50
1,0,2020-05-22 22:06:39.689804+00:00,1,1,85,66,29,0,26.6,0.351,31
2,1,2020-05-23 22:06:39.689804+00:00,2,8,183,64,0,0,23.3,0.672,32
3,0,2020-05-24 22:06:39.689804+00:00,3,1,89,66,23,94,28.1,0.167,21
4,1,2020-05-25 22:06:39.689804+00:00,4,0,137,40,35,168,43.1,2.288,33
...,...,...,...,...,...,...,...,...,...,...,...
763,0,2022-06-23 22:06:39.689804+00:00,763,10,101,76,48,180,32.9,0.171,63
764,0,2022-06-24 22:06:39.689804+00:00,764,2,122,70,27,0,36.8,0.340,27
765,0,2022-06-25 22:06:39.689804+00:00,765,5,121,72,23,112,26.2,0.245,30
766,1,2022-06-26 22:06:39.689804+00:00,766,1,126,60,0,0,30.1,0.349,47


### 7. Model Training

In [37]:
# Importing dependencies
from feast import FeatureStore
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from joblib import dump

# Getting our FeatureStore
store = FeatureStore(repo_path=".")

# Retrieving the saved dataset and converting it to a DataFrame
training_df = store.get_saved_dataset(name="diabetes_dataset").to_df()

# Separating the features and labels
y = training_df['Outcome']
X = training_df.drop(
    labels=['Outcome', 'event_timestamp', "patient_id"], 
    axis=1)

# Splitting the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    stratify=y)

# Creating and training LogisticRegression
reg = LogisticRegression()
reg.fit(X=X_train[sorted(X_train)], y=y_train)

# Saving the model
dump(value=reg, filename="model.joblib")

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


['model.joblib']

### 8. Prepare online feature store
(Loading the features to online store)

There are two ways you can use to load features to your online store 
- materialize

materialize loads the latest features between two dates.

`feast materialize 2020–01–01T00:00:00 2022–01–01T00:00:00`

- materialize-incremental

materialize-incremental loads features up to the provided end date:

`feast materialize-incremental 2022–01–01T00:00:00`

In [39]:
# Importing dependencies
from feast import FeatureStore
from datetime import datetime, timedelta

# Getting our FeatureStore
store = FeatureStore(repo_path=".")

store.materialize_incremental(end_date = datetime.now())

Materializing [1m[32m2[0m feature views to [1m[32m2022-06-27 22:59:48+05:30[0m into the [1m[32msqlite[0m online store.

[1m[32mpredictors_df_feature_view[0m from [1m[32m2022-06-25 17:29:48+05:30[0m to [1m[32m2022-06-27 22:59:48+05:30[0m:


100%|████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 216.60it/s]


[1m[32mtarget_df_feature_view[0m from [1m[32m2022-06-25 17:29:48+05:30[0m to [1m[32m2022-06-28 04:29:48+05:30[0m:


100%|████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 380.88it/s]


### 9. Get online features for prediction

In [40]:
# Importing dependencies
from feast import FeatureStore
import pandas as pd
from joblib import load

# Getting our FeatureStore
store = FeatureStore(repo_path=".")

# Defining our features names
feast_features = [
        "predictors_df_feature_view:Pregnancies",
        "predictors_df_feature_view:Glucose",
        "predictors_df_feature_view:BloodPressure",
        "predictors_df_feature_view:SkinThickness",
        "predictors_df_feature_view:Insulin",
        "predictors_df_feature_view:BMI",
        "predictors_df_feature_view:DiabetesPedigreeFunction",
        "predictors_df_feature_view:Age",
    ]

# Getting the latest features
features = store.get_online_features(
    features=feast_features,    
    entity_rows=[{"patient_id": 767}, {"patient_id": 766}]
).to_dict()

# Converting the features to a DataFrame
features_df = pd.DataFrame.from_dict(data=features)



In [41]:
features_df.head()

Unnamed: 0,patient_id,SkinThickness,Age,Glucose,BloodPressure,BMI,Pregnancies,DiabetesPedigreeFunction,Insulin
0,767,31,23,93,70,30.4,1,0.315,0
1,766,0,47,126,60,30.1,1,0.349,0


### 10. Call the predict function and see the output

In [42]:
# Loading our model and doing inference
reg = load("model.joblib")
predictions = reg.predict(features_df[sorted(features_df.drop("patient_id", axis=1))])
print(predictions)

[0 0]


### References:

https://docs.feast.dev/getting-started/quickstart