# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 02: Training Data & Feature views</span>

<span style="font-width:bold; font-size: 1.4rem;">This is the second part of the quick start series of tutorials about Hopsworks Feature Store. This notebook explains how to read from a feature group and create training dataset within the feature store</span>

## **🗒️ In this notebook we will see how to create a training dataset from the feature groups:** 
1. **Select the features** we want to train our model on,
2. **How the features should be preprocessed,**
3. **Create a dataset split** for training and validation data.

![tutorial-flow](images/02_training-dataset.png) 

In [1]:
import hsfs

conn = hsfs.connection()
fs = conn.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.


## <span style="color:#ff5f27;"> 🔪 Feature Selection </span>

We start by selecting all the features we want to include for model training/inference.

In [2]:
# Load feature groups.
trans_fg = fs.get_feature_group('transactions', version=1)
window_aggs_fg = fs.get_feature_group('transactions_4h_aggs', version=1)

# Select features for training data.
ds_query = trans_fg.select(["fraud_label", "category", "amount", "age_at_transaction", "days_until_card_expires", "loc_delta"])\
    .join(window_aggs_fg.select_except(["cc_num"]), on="cc_num")\

ds_query.show(5)

2022-05-31 16:16:34,323 INFO: USE `fraud_simplified_featurestore`
2022-05-31 16:16:35,139 INFO: WITH right_fg0 AS (SELECT *
FROM (SELECT `fg1`.`fraud_label` `fraud_label`, `fg1`.`category` `category`, `fg1`.`amount` `amount`, `fg1`.`age_at_transaction` `age_at_transaction`, `fg1`.`days_until_card_expires` `days_until_card_expires`, `fg1`.`loc_delta` `loc_delta`, `fg1`.`cc_num` `join_pk_cc_num`, `fg1`.`datetime` `join_evt_datetime`, `fg0`.`trans_volume_mstd` `trans_volume_mstd`, `fg0`.`trans_volume_mavg` `trans_volume_mavg`, `fg0`.`trans_freq` `trans_freq`, `fg0`.`loc_delta_mavg` `loc_delta_mavg`, RANK() OVER (PARTITION BY `fg1`.`cc_num`, `fg1`.`datetime` ORDER BY `fg0`.`datetime` DESC) pit_rank_hopsworks
FROM `fraud_simplified_featurestore`.`transactions_1` `fg1`
INNER JOIN `fraud_simplified_featurestore`.`transactions_4h_aggs_1` `fg0` ON `fg1`.`cc_num` = `fg0`.`cc_num` AND `fg1`.`datetime` >= `fg0`.`datetime`) NA
WHERE `pit_rank_hopsworks` = 1) (SELECT `right_fg0`.`fraud_label` `fraud

Unnamed: 0,fraud_label,category,amount,age_at_transaction,days_until_card_expires,loc_delta,trans_volume_mstd,trans_volume_mavg,trans_freq,loc_delta_mavg
0,0,Grocery,93.51,25.334094,175.91228,0.0,93.51,93.51,93.51,0.0
1,0,Domestic Transport,65.14,25.335632,175.350486,0.319574,65.14,65.14,65.14,0.319574
2,0,Grocery,0.26,25.336235,175.130347,0.314148,0.26,0.26,0.26,0.314148
3,0,Grocery,1.43,25.33666,174.975058,0.0,0.845,0.845,0.845,0.157074
4,0,Grocery,19.75,25.34471,172.034664,0.105313,19.75,19.75,19.75,0.105313


Recall that we computed the features in `transactions_4h_aggs` using 4-hour aggregates. If we had created multiple feature groups with identical schema for different window lengths, and wanted to include them in the join we would need to include a prefix argument in the join to avoid feature name clash. See the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/query_api/#join) for more details.

v🤖 Transformation Functions </span>

We will preprocess our data using *min-max scaling* on numerical features and *label encoding* on categorical features. To do this we simply define a mapping between our features and transformation functions. This ensures that transformation functions such as *min-max scaling* are fitted only on the training data (and not the validation/test data), which ensures that there is no data leakage.

In [3]:
# Load transformation functions.
min_max_scaler = fs.get_transformation_function(name="min_max_scaler")
label_encoder = fs.get_transformation_function(name="label_encoder")

# Map features to transformations.
transformation_functions = {
    "category": label_encoder,
    "amount": min_max_scaler,
    "trans_volume_mavg": min_max_scaler,
    "trans_volume_mstd": min_max_scaler,
    "trans_freq": min_max_scaler,
    "loc_delta": min_max_scaler,
    "loc_delta_mavg": min_max_scaler,
    "age_at_transaction": min_max_scaler,
    "days_until_card_expires": min_max_scaler,
}

## <span style="color:#ff5f27;"> ⚙️ Feature View Creation </span>

The Feature Views allows schema in form of a query with filters, define a model target feature/label and additional transformation functions.
In order to create a Feature View we may use `fs.create_feature_view()`

In [4]:
feature_view = fs.create_feature_view(
    name='transactions_view',
    query=ds_query,
    label=["fraud_label"],
    transformation_functions=transformation_functions
)

To view and explore data in the feature view we can retrieve batch data using `get_batch_data()` method 

In [5]:
feature_view.get_batch_data().head(5)

2022-05-31 16:17:29,639 INFO: USE `fraud_simplified_featurestore`
2022-05-31 16:17:30,510 INFO: WITH right_fg0 AS (SELECT *
FROM (SELECT `fg1`.`category` `category`, `fg1`.`amount` `amount`, `fg1`.`age_at_transaction` `age_at_transaction`, `fg1`.`days_until_card_expires` `days_until_card_expires`, `fg1`.`loc_delta` `loc_delta`, `fg1`.`cc_num` `join_pk_cc_num`, `fg1`.`datetime` `join_evt_datetime`, `fg0`.`trans_volume_mstd` `trans_volume_mstd`, `fg0`.`trans_volume_mavg` `trans_volume_mavg`, `fg0`.`trans_freq` `trans_freq`, `fg0`.`loc_delta_mavg` `loc_delta_mavg`, RANK() OVER (PARTITION BY `fg1`.`cc_num`, `fg1`.`datetime` ORDER BY `fg0`.`datetime` DESC) pit_rank_hopsworks
FROM `fraud_simplified_featurestore`.`transactions_1` `fg1`
INNER JOIN `fraud_simplified_featurestore`.`transactions_4h_aggs_1` `fg0` ON `fg1`.`cc_num` = `fg0`.`cc_num` AND `fg1`.`datetime` >= `fg0`.`datetime`) NA
WHERE `pit_rank_hopsworks` = 1) (SELECT `right_fg0`.`category` `category`, `right_fg0`.`amount` `amount`, `

Unnamed: 0,category,amount,age_at_transaction,days_until_card_expires,loc_delta,trans_volume_mstd,trans_volume_mavg,trans_freq,loc_delta_mavg
0,Grocery,93.51,25.334094,175.91228,0.0,93.51,93.51,93.51,0.0
1,Domestic Transport,65.14,25.335632,175.350486,0.319574,65.14,65.14,65.14,0.319574
2,Grocery,0.26,25.336235,175.130347,0.314148,0.26,0.26,0.26,0.314148
3,Grocery,1.43,25.33666,174.975058,0.0,0.845,0.845,0.845,0.157074
4,Grocery,19.75,25.34471,172.034664,0.105313,19.75,19.75,19.75,0.105313


## <span style="color:#ff5f27;"> 🏋️ Training Dataset Creation</span>

In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

**Training Dataset  may contain splits such as:** 
* Training set - the subset of training data used to train a model.
* Validation set - the subset of training data used to evaluate hparams when training a model
* Test set - the holdout subset of training data used to evaluate a mode

Training dataset is created using `fs.create_training_dataset()` method.

**From feature view APIs we can also create training datasts based on even time filters specifing `start_time` and `end_time`** 



In [6]:
from datetime import datetime
date_format = "%Y-%m-%d %H:%M:%S"

In [7]:
# Create training datasets based event time filter
start_time = int(float(datetime.strptime("2022-01-01 00:00:01", date_format).timestamp()) * 1000)
end_time = int(float(datetime.strptime("2022-02-28 23:59:59", date_format).timestamp()) * 1000)

td_version, td_job = feature_view.create_training_dataset(
    description = 'transactions_dataset_jan_feb',
    data_format = 'csv',
    splits = {'train': 80, 'validation': 20},
    train_split = "train",    
    write_options = {'wait_for_job': True},
    coalesce = True,
    start_time = start_time,
    end_time = end_time,
)

Training dataset job started successfully, you can follow the progress at https://hopsworks.glassfish.service.consul:8182/p/130/jobs/named/transactions_view_2_1_create_fv_td_31052022161820/executions




## <span style="color:#ff5f27;"> 🪝 Training Dataset retreival </span>

To retrieve training data from storage (already materialised) or from feature groups direcly we can use `get_training_dataset_splits` or `get_training_dataset` methods. If version is not provided or provided version has not already existed, it creates a new version of training data according to given arguments and returns a dataframe. If version is provided and has already existed, it reads training data from storage or feature groups and returns a dataframe. If split is provided, it reads the specific split.

In [9]:
_, df = feature_view.get_training_dataset_splits({'train': 80, 'validation': 20}, version = td_version)



In [11]:
df['train']

Unnamed: 0,fraud_label,category,amount,age_at_transaction,days_until_card_expires,loc_delta,trans_volume_mstd,trans_volume_mavg,trans_freq,loc_delta_mavg
0,0,0,0.000000e+00,0.047263,0.942244,0.035718,0.000000e+00,0.000000e+00,0.000000e+00,0.038668
1,0,0,0.000000e+00,0.063646,0.109239,0.000044,0.000000e+00,0.000000e+00,0.000000e+00,0.000047
2,0,0,0.000000e+00,0.340523,0.187686,0.211902,0.000000e+00,0.000000e+00,0.000000e+00,0.229403
3,0,0,0.000000e+00,0.954656,0.871548,0.183911,0.000000e+00,0.000000e+00,0.000000e+00,0.199101
4,0,0,3.336858e-07,0.363999,0.655571,0.093095,3.336858e-07,3.336858e-07,3.336858e-07,0.100784
...,...,...,...,...,...,...,...,...,...,...
74554,1,5,7.304049e-03,0.909320,0.721874,0.214688,4.702467e-03,4.702467e-03,4.702467e-03,0.205915
74555,1,5,1.362873e-02,0.516229,0.296445,0.086154,5.831589e-03,5.831589e-03,5.831589e-03,0.100012
74556,1,8,1.698461e-04,0.132060,0.811612,0.217638,1.698461e-04,1.698461e-04,1.698461e-04,0.235613
74557,1,8,4.875150e-04,0.488922,0.596687,0.043470,4.875150e-04,4.875150e-04,4.875150e-04,0.047060


In [12]:
df['validation']

Unnamed: 0,fraud_label,category,amount,age_at_transaction,days_until_card_expires,loc_delta,trans_volume_mstd,trans_volume_mavg,trans_freq,loc_delta_mavg
0,0,0,0.000000e+00,0.010738,0.846526,0.024955,0.000000e+00,0.000000e+00,0.000000e+00,0.027016
1,0,0,3.336858e-07,0.374494,0.444673,0.246222,3.336858e-07,3.336858e-07,3.336858e-07,0.266557
2,0,0,3.336858e-07,0.850157,0.946648,0.134638,3.336858e-07,3.336858e-07,3.336858e-07,0.145757
3,0,0,6.673716e-07,0.030149,0.423169,0.207004,6.673716e-07,6.673716e-07,6.673716e-07,0.224101
4,0,0,6.673716e-07,0.031708,0.683413,0.239105,7.633063e-04,7.633063e-04,7.633063e-04,0.176698
...,...,...,...,...,...,...,...,...,...,...
18419,1,4,2.690842e-03,0.206518,0.208694,0.027481,2.856851e-03,2.856851e-03,2.856851e-03,0.032210
18420,1,4,2.730885e-03,0.909320,0.721873,0.132608,2.889330e-03,2.889330e-03,2.889330e-03,0.175565
18421,1,4,3.122632e-03,0.448481,0.634433,0.165436,1.399955e-03,1.399955e-03,1.399955e-03,0.135249
18422,1,4,3.267118e-03,0.909320,0.721872,0.198211,2.489482e-03,2.489482e-03,2.489482e-03,0.168709


## <span style="color:#ff5f27;">⏭️ **Next:** Part 03 </span>

In the following notebook, we will train a model on the dataset we created in this notebook and have quick overview of the lineage.