### Training Dataset

When building a model data scientists go trhough a phase of data exploration

In [1]:
from pyspark.sql import functions as F
from pyspark.sql.types import * 
import hsfs

connection = hsfs.connection()
fs = connection.get_feature_store()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log
8,application_1605601890461_0006,pyspark,idle,Link,Link


SparkSession available as 'spark'.
Connected. Call `.close()` to terminate connection gracefully.

The first step retrieve all the feature groups in which we are interested on.

In [4]:
real_estate_fg = fs.get_feature_group("real_estate", version=1)
health_area_fg = fs.get_feature_group("health_area", version=1)
police_prct_fg = fs.get_feature_group("police_prct", version=1)
school_dist_fg = fs.get_feature_group("school_dist", version=1)

For the sake of the example, we are not interested in all the feature we have computed in the previous notebook. In the next cell we select only the specific ones, either by looking at the feature group metadata or by list them

In [9]:
real_estate_features = [f.name for f in real_estate_fg.features]
real_estate_features.remove("property_id")
real_estate_features.remove("owner_type")

police_prct_features = ['police_prct_avg_sale_price', 'police_prct_avg_sale_price_owner', 'police_prct_sale_price_cat']
health_area_features = ['health_area_avg_sale_price', 'health_area_avg_sale_price_owner', 'health_area_avg_sale_price_cat']
school_dist_features = ['school_dist_avg_sale_price', 'school_dist_avg_sale_price_owner', 'school_dist_avg_sale_price_cat']

The key components of building a training dataset is joining feature groups. In Hopsworks we use a Pandas-Like API that allow you to join several feature groups together. A basic example would be the following: 

```
td_features = real_estate_fg.select(real_estate_features)\
              .join(health_area_fg.select(health_area_features))\
```

In this case we select only the features we are interested from the real_estate feature group and from the health_area feature group. We let the query constructor in Hopsworks determine how to join the feature groups and which features to use as joining keys.


The next cell is a bit more complex and involves us overwriting the joining keys with our user-provided ones.

In [6]:
td_features = real_estate_fg.select(real_estate_features)\
              .join(health_area_fg.select(health_area_features), on=['health_area', 'owner_type', 'building_class'])\
              .join(police_prct_fg.select(police_prct_features), on=['police_prct', 'owner_type', 'building_class'])\
              .join(school_dist_fg.select(school_dist_features), on=['school_dist', 'owner_type', 'building_class'])

Joining feature groups generate a SQL query that will be run on Spark. You can inspect the query for debugging:

In [10]:
print(td_features.to_string())

SELECT `fg0`.`building_class`, `fg0`.`sale_price`, `fg0`.`is_owner_company`, `fg0`.`residential_units`, `fg0`.`age_at_sale`, `fg0`.`res_area`, `fg0`.`is_owner_organization`, `fg0`.`police_prct`, `fg0`.`garage_area`, `fg0`.`is_owner_private`, `fg0`.`is_large_residential`, `fg0`.`school_dist`, `fg0`.`health_area`, `fg0`.`is_single_unit`, `fg0`.`has_garage_area`, `fg1`.`health_area_avg_sale_price`, `fg1`.`health_area_avg_sale_price_owner`, `fg1`.`health_area_avg_sale_price_cat`, `fg2`.`police_prct_avg_sale_price`, `fg2`.`police_prct_avg_sale_price_owner`, `fg2`.`police_prct_sale_price_cat`, `fg3`.`school_dist_avg_sale_price`, `fg3`.`school_dist_avg_sale_price_owner`, `fg3`.`school_dist_avg_sale_price_cat`
FROM `dataai_featurestore`.`real_estate_1` `fg0`
INNER JOIN `dataai_featurestore`.`health_area_1` `fg1` ON `fg0`.`health_area` = `fg1`.`health_area` AND `fg0`.`owner_type` = `fg1`.`owner_type` AND `fg0`.`building_class` = `fg1`.`building_class`
INNER JOIN `dataai_featurestore`.`police_pr

Finally you can use the query to create a training datasets. Creating a training dataset is similar to the feature group creation. For training datasets you can specify a label, or target feature and you can split the training datasets into different splits for training, testing and validation

In [8]:
td = fs.create_training_dataset("real_estate_price",
                                version=1,
                                data_format="csv",
                                label=['sale_price'],
                                description="A dataset to train a real estate property value prediction",
                                statistics_config={'histograms': True, 'correlations': False},
                                splits={'train': 0.7, 'test': 0.2, 'validate': 0.1})

td.save(td_features)

<hsfs.training_dataset.TrainingDataset object at 0x7f366a9f6950>