# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 03: Model training & UI Exploration</span>

<span style="font-width:bold; font-size: 1.4rem;">In this last notebook, we will train a model on the dataset we created in the previous tutorial. We will train our model using standard Python and Scikit-learn, although it could just as well be trained with other machine learning frameworks such as PySpark, TensorFlow, and PyTorch. We will also show some of the exploration that can be done in Hopsworks, notably the search functions and the lineage. </span>

## **🗒️ This notebook is divided in 3 main sections:** 
1. **Loading the training data**
2. **Train the model**
3. **Explore feature groups and views** via the UI.

![tutorial-flow](images/03_model.png)

In [1]:
import hsfs

conn = hsfs.connection()
fs = conn.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.


## <span style="color:#ff5f27;"> ✨ Load Training Data </span>

First, we'll need to fetch the training dataset that we created in the previous notebook. We will use January - February data training and testing.

In [2]:
import pandas as pd
from sklearn.linear_model import LogisticRegression

# Load data.
feature_view = fs.get_feature_view("transactions_view", 1)
_, df = feature_view.get_training_dataset_splits({'train': 80, 'validation': 20}, version = 1)

X_train = df["train"]
X_val = df['validation']



We will train a model to predict `fraud_label` given the rest of the features.

In [3]:
# Separate target feature from input features.
target = target = feature_view.label[0]  
y_train = X_train.pop(target)
y_val = X_val.pop(target)

Let's check the distribution of our target label.

In [4]:
y_train.value_counts(normalize=True)

0    0.998337
1    0.001663
Name: fraud_label, dtype: float64

Notice that the distribution is extremely skewed, which is natural considering that fraudulent transactions make up a tiny part of all transactions. Thus we should somehow address the class imbalance. There are many approaches for this, such as weighting the loss function, over- or undersampling, creating synthetic data, or modifying the decision threshold. In this example, we'll use the simplest method which is to just supply a class weight parameter to our learning algorithm. The class weight will affect how much importance is attached to each class, which in our case means that higher importance will be placed on positive (fraudulent) samples.

## <span style="color:#ff5f27;"> 🏃 Train Model</span>

Next we'll train a model. Here, we set the class weight of the positive class to be twice as big as the negative class.

In [5]:
# Train model.
pos_class_weight = 0.9
clf = LogisticRegression(class_weight={0: 1.0 - pos_class_weight, 1: pos_class_weight}, solver='liblinear')
clf.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight={0: 0.09999999999999998, 1: 0.9},
                   dual=False, fit_intercept=True, intercept_scaling=1,
                   l1_ratio=None, max_iter=100, multi_class='auto', n_jobs=None,
                   penalty='l2', random_state=None, solver='liblinear',
                   tol=0.0001, verbose=0, warm_start=False)

Let's see how well it performs on our validation data.

In [6]:
from sklearn.metrics import classification_report

preds = clf.predict(X_val)

print(classification_report(y_val, preds))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00     18393
           1       0.00      0.00      0.00        31

    accuracy                           1.00     18424
   macro avg       0.50      0.50      0.50     18424
weighted avg       1.00      1.00      1.00     18424





## <span style="color:#ff5f27;">  Use the model to score transactions </span>
We trained model based on January - February data. Now lets retrieve March data and score whether transactions are fraudulend or not   


In [7]:
from datetime import datetime
date_format = "%Y-%m-%d %H:%M:%S"
# Create training datasets based event time filter
start_time = int(float(datetime.strptime("2022-01-03 00:00:01", date_format).timestamp()) * 1000)
end_time = int(float(datetime.strptime("2022-03-31 23:59:59", date_format).timestamp()) * 1000)

feature_view.init_batch_scoring(1)
march_transactions = feature_view.get_batch_data(start_time = start_time,  end_time = end_time)

2022-05-31 21:10:33,992 INFO: USE `fraud_simplified_featurestore`
2022-05-31 21:10:34,773 INFO: WITH right_fg0 AS (SELECT *
FROM (SELECT `fg1`.`category` `category`, `fg1`.`amount` `amount`, `fg1`.`age_at_transaction` `age_at_transaction`, `fg1`.`days_until_card_expires` `days_until_card_expires`, `fg1`.`loc_delta` `loc_delta`, `fg1`.`cc_num` `join_pk_cc_num`, `fg1`.`datetime` `join_evt_datetime`, `fg0`.`trans_volume_mstd` `trans_volume_mstd`, `fg0`.`trans_volume_mavg` `trans_volume_mavg`, `fg0`.`trans_freq` `trans_freq`, `fg0`.`loc_delta_mavg` `loc_delta_mavg`, RANK() OVER (PARTITION BY `fg1`.`cc_num`, `fg1`.`datetime` ORDER BY `fg0`.`datetime` DESC) pit_rank_hopsworks
FROM `fraud_simplified_featurestore`.`transactions_1` `fg1`
INNER JOIN `fraud_simplified_featurestore`.`transactions_4h_aggs_1` `fg0` ON `fg1`.`cc_num` = `fg0`.`cc_num` AND `fg1`.`datetime` >= `fg0`.`datetime`
WHERE `fg1`.`datetime` >= 1641168001000 AND `fg1`.`datetime` <= 1648771199000) NA
WHERE `pit_rank_hopsworks` = 

In [8]:
march_transactions

Unnamed: 0,category,amount,age_at_transaction,days_until_card_expires,loc_delta,trans_volume_mstd,trans_volume_mavg,trans_freq,loc_delta_mavg
0,4,0.003120,0.091487,0.117163,0.000000,0.003120,0.003120,0.003120,0.000000
1,2,0.002173,0.091506,0.116883,0.122200,0.002173,0.002173,0.002173,0.132292
2,4,0.000008,0.091513,0.116773,0.120125,0.000008,0.000008,0.000008,0.130046
3,4,0.000047,0.091518,0.116696,0.000000,0.000028,0.000028,0.000028,0.065023
4,4,0.000659,0.091615,0.115229,0.040270,0.000659,0.000659,0.000659,0.043596
...,...,...,...,...,...,...,...,...,...
102971,0,0.000736,0.357286,0.467677,0.228904,0.000736,0.000736,0.000736,0.247809
102972,0,0.002816,0.357321,0.467147,0.166719,0.002816,0.002816,0.002816,0.180488
102973,0,0.002934,0.357325,0.467088,0.166874,0.002875,0.002875,0.002875,0.180572
102974,0,0.010322,0.357392,0.466076,0.001149,0.010322,0.010322,0.010322,0.001244


In [9]:
predictions = clf.predict(march_transactions)

In [10]:
predictions

array([0, 0, 0, ..., 0, 0, 0])

## <span style="color:#ff5f27;"> 👓  Exploration</span>
In the Hopsworks feature store, the metadata allows for multiple levels of explorations and review. Here we will show a few of those capacities. 

### 🔎 <b>Search</b> 
Using the search function in the ui, you can query any aspect of the feature groups and training data that was previously created. In the gif below we show how the tag we added in the first section can be searched to get all the feature groups with the `PII` tag value.

### 📊 <b>Statistics</b> 
We can also enable statistics in one or all the feature groups here we commented the command so that it wouldnt run for too long. 

In [None]:
#trans_fg = fs.get_feature_group()
#trans_fg.update_statistics_config(
#       enabled=True, 
#       correlations=False, 
#       histograms=False, 
#)


### ⛓️ <b> Lineage </b> 
In all the feature groups and feature view you can look at the relation between each abstractions; what feature group created which training dataset and that is used in which model.
This allows for a clear undestanding of the pipeline in relation to each element. 

## <span style="color:#ff5f27;"> 🎁  Wrapping things up </span>

We have now performed a simple training with training data that we have created in the feature store. This concludes the fisrt module and introduction to the core aspect of the feauture store. In the second module we will introduce streaming and external feature groups for a similar fraud use case.