**Data Preparation**

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('diabetes.csv')
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
predictors_df = data.loc[:,data.columns!='Outcome']

In [4]:
target_df = data['Outcome']

In [5]:
rowLen = data.shape[0]
timestamps = pd.date_range(end=pd.Timestamp.now(), periods=rowLen, freq='D').to_frame(name='event_timestamp', index=False)

In [6]:
data.shape

(768, 9)

In [7]:
idsList = list(range(rowLen))
patient_ids = pd.DataFrame(idsList, columns=['patient_id'])

In [8]:
timestamps

Unnamed: 0,event_timestamp
0,2021-07-02 14:27:51.457771
1,2021-07-03 14:27:51.457771
2,2021-07-04 14:27:51.457771
3,2021-07-05 14:27:51.457771
4,2021-07-06 14:27:51.457771
...,...
763,2023-08-04 14:27:51.457771
764,2023-08-05 14:27:51.457771
765,2023-08-06 14:27:51.457771
766,2023-08-07 14:27:51.457771


In [9]:
predictors_df = pd.concat(objs=[predictors_df, timestamps, patient_ids], axis=1)
target_df = pd.concat(objs=[target_df, timestamps, patient_ids], axis=1)

In [10]:
predictors_df.to_parquet("predictors_df.parquet")
target_df.to_parquet("target_df.parquet")

**Create feature repo**

In [11]:
!feast init feature_repo

  from distutils.dir_util import copy_tree
The directory [1m[32mfeature_repo[0m contains an existing feature store repository that may cause a conflict



In [12]:
predictors_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   Pregnancies               768 non-null    int64         
 1   Glucose                   768 non-null    int64         
 2   BloodPressure             768 non-null    int64         
 3   SkinThickness             768 non-null    int64         
 4   Insulin                   768 non-null    int64         
 5   BMI                       768 non-null    float64       
 6   DiabetesPedigreeFunction  768 non-null    float64       
 7   Age                       768 non-null    int64         
 8   event_timestamp           768 non-null    datetime64[ns]
 9   patient_id                768 non-null    int64         
dtypes: datetime64[ns](1), float64(2), int64(7)
memory usage: 60.1 KB


In [13]:
target_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Outcome          768 non-null    int64         
 1   event_timestamp  768 non-null    datetime64[ns]
 2   patient_id       768 non-null    int64         
dtypes: datetime64[ns](1), int64(2)
memory usage: 18.1 KB


In [20]:
cd feature_repo

/home/tinku/Linux_Workspace/feast/feature_repo


In [24]:
pwd

'/home/tinku/Linux_Workspace/feast/feature_repo'

In [26]:
!feast apply

Created entity [1m[32mpatient_id[0m
Created feature view [1m[32mpredictors_df_feature_view[0m
Created feature view [1m[32mtarget_df_feature_view[0m

Created sqlite table [1m[32mfeature_repo_predictors_df_feature_view[0m
Created sqlite table [1m[32mfeature_repo_target_df_feature_view[0m



**Generating Training Data Set**

In [27]:
from feast import FeatureStore
from feast.infra.offline_stores.file_source import SavedDatasetFileStorage

In [28]:
fs = FeatureStore(repo_path='.')

In [29]:
df_y = pd.read_parquet(path = "data/target_df.parquet")
x_list = [
    "predictors_df_feature_view:Pregnancies",
    "predictors_df_feature_view:Glucose",
    "predictors_df_feature_view:BloodPressure",
    "predictors_df_feature_view:SkinThickness",
    "predictors_df_feature_view:Insulin",
    "predictors_df_feature_view:BMI",
    "predictors_df_feature_view:DiabetesPedigreeFunction",
    "predictors_df_feature_view:Age",
]

**Create training set (x,y) from the feature store - use the the target(y) and features(x) defined above**

In [30]:
training_data = fs.get_historical_features(
    entity_df = df_y,
    features = x_list
)

In [32]:
dataset = fs.create_saved_dataset(
    from_ = training_data,
    name = "diabetes_dataset",
    storage = SavedDatasetFileStorage("data/diabetes_dataset.parquet")
)



**Simple model training**

In [34]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from joblib import dump

# Getting our FeatureStore
store = FeatureStore(repo_path=".")

# Retrieving the saved dataset and converting it to a DataFrame
training_df = store.get_saved_dataset(name="diabetes_dataset").to_df()

# Separating the features and labels
y = training_df['Outcome']
X = training_df.drop(
    labels=['Outcome', 'event_timestamp', "patient_id"], 
    axis=1)

# Splitting the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    stratify=y)

# Creating and training LogisticRegression
reg = LogisticRegression()
reg.fit(X=X_train[sorted(X_train)], y=y_train)

# Saving the model
dump(value=reg, filename="model.joblib")

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


['model.joblib']

**Online feature store**

In [37]:
from feast import FeatureStore
from datetime import datetime

In [36]:
fs = FeatureStore(repo_path = ".")
fs.materialize_incremental(end_date = datetime.now())

Materializing [1m[32m2[0m feature views to [1m[32m2023-08-08 17:14:55-04:00[0m into the [1m[32msqlite[0m online store.

[1m[32mpredictors_df_feature_view[0m from [1m[32m2023-08-06 21:14:55-04:00[0m to [1m[32m2023-08-08 17:14:55-04:00[0m:


100%|████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 238.00it/s]


[1m[32mtarget_df_feature_view[0m from [1m[32m2023-08-06 21:14:55-04:00[0m to [1m[32m2023-08-08 13:14:55-04:00[0m:


100%|████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 264.29it/s]


**Inference**

In [38]:
from joblib import load

In [45]:
latest_entries = store.get_online_features(
    entity_rows = [{"patient_id": 767}, {"patient_id": 766}],
    features = x_list
).to_dict()
latest_df = pd.DataFrame.from_dict(data=latest_entries)

In [46]:
reg = load("model.joblib")
predictions = reg.predict(latest_df[sorted(latest_df.drop("patient_id", axis=1))])

In [47]:
predictions

array([0, 0])