---
title: "2. Create Training Data from Features "
date: 2021-02-24
type: technical_note
draft: false
---

## 🧑🏻‍🏫 HSFS `Feature Views` and `Training Datasets`

`Feature Views` is the third building block of the Hopsworks Feature Store. Feature Views store metadata of our dataset.

`Training datasets` is the fourth building block of the Hopsworks Feature Store. 

Training datasets can be saved in a ML framework friendly format (eg. TfRecords, CSV, Numpy) and then be fed to a machine learning model for training.

Training datasets can also be stored on external storage systems like Amazon S3 or GCS to be read by external model training platforms.

As with the previous notebooks, the first step is to establish a connection with the Hopsworks feature store and get the feature store handle

In [1]:
import hsfs

# Create a connection
connection = hsfs.connection()

# Get the feature store handle for the project's feature store
fs = connection.get_feature_store()

Starting Spark application


ID,Application ID,Kind,State,Spark UI,Driver log
30,application_1653473648291_0126,pyspark,idle,Link,Link


SparkSession available as 'spark'.
Connected. Call `.close()` to terminate connection gracefully.

---

## 🔮 Create a `Feature View` from a query

In the previous notebook ([feature_exploration](./feature_exploration.ipynb)) we walked through how to explore and query the Hopsworks feature store using HSFS. We can use the queries produced in the previous notebook to create a `Feature Views`.

In [2]:
sales_fg = fs.get_feature_group(
    name = 'sales_fg',
    version = 1
)

exogenous_fg = fs.get_feature_group(
    name = 'exogenous_fg',
    version = 2
)

query = sales_fg.select_all()\
        .join(exogenous_fg.select(['fuel_price', 'unemployment', 'cpi']))

In order to create Feature View we can use `FeatureStore.create_feature_view()` method.

In [3]:
feature_view = fs.create_feature_view(
    name = 'exodenous_sale',
    version = 1,
    query = query
)

In [4]:
feature_view

<hsfs.feature_view.FeatureView object at 0x7fa0d5e6dbe0>

For now `Feature View` is saved in Hopsworks and we can retrieve it using `FeatureStore.get_feature_view()`

In [5]:
feature_view = fs.get_feature_view(
    name = 'exodenous_sale',
    version = 1
)

In [6]:
feature_view.version

1

> `FeatureView.preview_feature_vector()` returns a sample of assembled serving vector from online feature store

In [7]:
feature_view.preview_feature_vector()

[]

> To get subset of data use `FeatureView.get_batch_data()` 

In [8]:
df_batch = feature_view.get_batch_data()

In [9]:
type(df_batch)

<class 'pyspark.sql.dataframe.DataFrame'>

In [10]:
df_batch.select(['fuel_price', 'unemployment', 'cpi']).show(5)

+----------+------------+-----------+
|fuel_price|unemployment|        cpi|
+----------+------------+-----------+
|     2.625|       8.106|211.3501429|
|     2.625|       8.106|211.3501429|
|     2.625|       8.106|211.3501429|
|     2.625|       8.106|211.3501429|
|     2.625|       8.106|211.3501429|
+----------+------------+-----------+
only showing top 5 rows

---

## 🧑🏻‍🔬 Training Dataset Creation

To create training dataset we use `FeatureView.create_training_dataset()` method.

⚠️ Some important things:
- It will inherit the name of FeatureView.

- The feature store currently supports the following data formats for
training datasets: **tfrecord, csv, tsv, parquet, avro, orc**.

- We can choose necessary format using **data_format** parameter.

- Also we can specify split ratio using **splits** parameter.

- **train_split** - specify which split will be used for training.

In [11]:
train_df = feature_view.create_training_dataset(
    version = 1,
    description = 'trial_dataset',
    data_format = 'csv',
    splits = {'train': 80, 'validation': 20},
    train_split = "train",
    write_options = {'wait_for_job': False}
)

In [12]:
train_df

(1, None)

If we want to load dataset from Hopsworks we can use `FeatureView.get_training_dataset_splits()` method.

By specifying **splits** parameter we can choose what split of training dataset to retrieve.

In [13]:
td_version, df = feature_view.get_training_dataset_splits(
    splits = {},
    start_time = None,
    end_time = None,
    version = 2
)

df.select(['store', 'dept', 'date', 'weekly_sales', 'is_holiday', 'sales_last_month_store_dep']).show(5)

+-----+----+----------+------------+----------+--------------------------+
|store|dept|      date|weekly_sales|is_holiday|sales_last_month_store_dep|
+-----+----+----------+------------+----------+--------------------------+
|    1|   1|2010-03-05|     21827.9|     false|                 131963.08|
|    1|  10|2010-03-05|    33299.27|     false|        119772.36000000002|
|    1|  11|2010-03-05|     19082.9|     false|                  81986.75|
|    1|  12|2010-03-05|    10239.06|     false|                  35284.96|
|    1|  13|2010-03-05|    40423.95|     false|                 153770.69|
+-----+----+----------+------------+----------+--------------------------+
only showing top 5 rows

---

# 🔮 Creating Training Datasets with Event Time filter

First of all lets import **datetime** from datetime library and set up a time format.

Then we can define start_time point and end_time point.

Finally we can create training dataset with data in specific time bourders. 

In [14]:
from datetime import datetime

def timestamp_2_time(x):
    dt_obj = datetime.strptime(x, '%Y-%m-%d')
    dt_obj = dt_obj.timestamp() * 1000
    return int(dt_obj)

In [15]:
start_time = timestamp_2_time('2008-01-01')
end_time = timestamp_2_time('2012-01-01')

In [16]:
exogenous_fg = fs.get_feature_group(
    name = 'exogenous_fg',
    version = 1
)

query = exogenous_fg.select_all()

In [17]:
exogenous_fv = fs.create_feature_view(
    name = 'exogenous_fg_2008_2012',
    version = 1,
    query = query
)

In [18]:
td_version, td_job = exogenous_fv.create_training_dataset(
    description = 'exogenous_fg_filtered',
    version = 1,
    data_format = 'csv',
    write_options = {'wait_for_job': True},
    start_time = start_time,
    end_time = end_time,
    statistics_config={"enabled": False, "histograms": False, "correlations": False, "exact_uniqueness": False}
)

### 🔬 Dataset Retrieving

To retrieve training dataset from Feature Store we can use `get_training_dataset_splits()` or `get_training_dataset()` methods. 

If version is not provided - new one will be created.
If version is provided and version exists - retrieves trainining dataset and returns as dataframe.

In [19]:
td_version, df = exogenous_fv.get_training_dataset(
    start_time = start_time,
    end_time = end_time
)

df.head()

Row(store=26, date=1265328000000, temperature=9.55, fuel_price=2.788, markdown1='NA', markdown2='NA', markdown3='NA', markdown4='NA', markdown5='NA', cpi='131.5279032', unemployment='8.488', is_holiday=False)

---