### Feature Engineering

In [1]:
from pyspark.sql import functions as F
from pyspark.sql.types import * 
import hsfs

connection = hsfs.connection()
fs = connection.get_feature_store()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log
7,application_1605601890461_0005,pyspark,idle,Link,Link


SparkSession available as 'spark'.
Connected. Call `.close()` to terminate connection gracefully.

### Brooklyn real estate sales

This example is based on [this](https://www.kaggle.com/tianhwu/brooklynhomes2003to2017?select=brooklyn_sales_map.csv) Kaggle dataset. The idea is to build features to train a model to predict real estate prices in Brooklyn

In [2]:
brooklyn_data = spark.read.format("csv").option("header", "true").load("hdfs:///Projects/dataai/Brooklyn/archive/brooklyn_sales_map.csv")

### Data cleaning and one-hot encoding

Let's start by creating the first feature groups, containing features related to the property being sold. We cast the data to integer of floats to be used to train models and  we one-hot encode some categorical features

In [3]:
apartments_data = brooklyn_data.filter("building_class_category like '%APARTMENTS%'")\
                               .withColumn("age_at_sale", F.col("year_of_sale") - F.col("year_built"))\
                               .withColumn("building_class", F.split(F.col("building_class_category"), " ").getItem(0))\
                               .where("ResArea != 'NA'")\
                               .selectExpr(['cast(building_class as int)', 'cast(residential_units as int)', 'age_at_sale', 'cast(sale_price as float)', 
                                            'cast(SchoolDist as int) as school_dist', 'cast(PolicePrct as int) as police_prct', 'cast(HealthArea as int) as health_area', 
                                            'cast(ResArea as int) as res_area', 'cast(GarageArea as int) as garage_area', 'OwnerType as owner_type'])

In [4]:
apartments_data.printSchema()

root
 |-- building_class: integer (nullable = true)
 |-- residential_units: integer (nullable = true)
 |-- age_at_sale: double (nullable = true)
 |-- sale_price: float (nullable = true)
 |-- school_dist: integer (nullable = true)
 |-- police_prct: integer (nullable = true)
 |-- health_area: integer (nullable = true)
 |-- res_area: integer (nullable = true)
 |-- garage_area: integer (nullable = true)
 |-- owner_type: string (nullable = true)

In [5]:
apartments_data_fg = apartments_data\
               .withColumn("property_id", F.monotonically_increasing_id())\
               .withColumn("is_owner_private", F.when(F.col("owner_type") == "P", 1).otherwise(0))\
               .withColumn("is_owner_company", F.when(F.col("owner_type") == "C", 1).otherwise(0))\
               .withColumn("is_owner_organization", F.when(F.col("owner_type") == "O", 1).otherwise(0))\
               .withColumn("is_single_unit", F.when(F.col("residential_units") == 1, 1).otherwise(0))\
               .withColumn("is_large_residential", F.when(F.col("residential_units") >= 100, 1).otherwise(0))\
               .withColumn("has_garage_area", F.when(F.col("garage_area") >= 0, 1).otherwise(0))

This section creates the first `real_estate` feature group. We are going to use this feature group for online serving, so we publish this feature group online. We are also going to configure statistics for it.

As explained in the documentation [docs.hopsworks.ai](https://docs.hopsworks.ai) - the first step creates the metadata object, while the second step actually saves the feature data in the feature store. 

In [6]:
aprmt_fg_meta = fs.create_feature_group("real_estate",
                        version=1,
                        description="Real estate features",
                        primary_key=['property_id'],
                        statistics_config={'histograms': True, 'correlations': True},
                        time_travel_format=None,
                        online_enabled=True)

aprmt_fg_meta.save(apartments_data_fg)

<hsfs.feature_group.FeatureGroup object at 0x7fa6d1d13710>

### Add contextual features

To make the example more interesting and realistic, we compute several other feature groups which contains contextual information. We are looking for example at the average prices for a given school district, health area and police district. 

We are keeping these features on separate feature groups to simulate multiple feature engineering pipelines writing to the Hopsworks Feature Store.

In [7]:
school_dist_avg_prices = apartments_data.groupBy("school_dist").agg({'sale_price': 'avg'}).withColumnRenamed("avg(sale_price)", "school_dist_avg_sale_price")
school_dist_avg_prices_owner = apartments_data.groupBy(["school_dist", "owner_type"]).agg({'sale_price': 'avg'}).withColumnRenamed("avg(sale_price)", "school_dist_avg_sale_price_owner")
school_dist_avg_prices_cat = apartments_data.groupBy(["school_dist", "building_class"]).agg({'sale_price': 'avg'}).withColumnRenamed("avg(sale_price)", "school_dist_avg_sale_price_cat")

school_dist_fg = school_dist_avg_prices.join(school_dist_avg_prices_owner, "school_dist").join(school_dist_avg_prices_cat, "school_dist")

In [8]:
school_dist_fg_meta = fs.create_feature_group("school_dist",
                        version=1,
                        description="School district features",
                        primary_key=['school_dist', 'owner_type', 'building_class'],
                        statistics_config={'histograms': True, 'correlations': True},
                        time_travel_format=None,
                        online_enabled=True)

school_dist_fg_meta.save(school_dist_fg)

<hsfs.feature_group.FeatureGroup object at 0x7fa68b547390>

In [9]:
health_area_avg_prices = apartments_data.groupBy("health_area").agg({'sale_price': 'avg'}).withColumnRenamed("avg(sale_price)", "health_area_avg_sale_price")
health_area_avg_prices_owner = apartments_data.groupBy(["health_area", "owner_type"]).agg({'sale_price': 'avg'}).withColumnRenamed("avg(sale_price)", "health_area_avg_sale_price_owner")
health_area_avg_prices_cat = apartments_data.groupBy(["health_area", "building_class"]).agg({'sale_price': 'avg'}).withColumnRenamed("avg(sale_price)", "health_area_avg_sale_price_cat")

health_area_fg = health_area_avg_prices.join(health_area_avg_prices_owner, "health_area").join(health_area_avg_prices_cat, "health_area")

In [10]:
health_area_fg_meta = fs.create_feature_group("health_area",
                        version=1,
                        description="Health area features",
                        primary_key=['health_area', 'owner_type', 'building_class'],
                        statistics_config={'histograms': True, 'correlations': True},
                        time_travel_format=None,
                        online_enabled=True)

health_area_fg_meta.save(health_area_fg)

<hsfs.feature_group.FeatureGroup object at 0x7fa68b0c3cd0>

In [11]:
police_prct_avg_prices = apartments_data.groupBy("police_prct").agg({'sale_price': 'avg'}).withColumnRenamed("avg(sale_price)", "police_prct_avg_sale_price")
police_prct_avg_prices_owner = apartments_data.groupBy(["police_prct", "owner_type"]).agg({'sale_price': 'avg'}).withColumnRenamed("avg(sale_price)", "police_prct_avg_sale_price_owner")
police_prct_avg_prices_cat = apartments_data.groupBy(["police_prct", "building_class"]).agg({'sale_price': 'avg'}).withColumnRenamed("avg(sale_price)", "police_prct_sale_price_cat")

police_prct_fg = police_prct_avg_prices.join(police_prct_avg_prices_owner, "police_prct").join(police_prct_avg_prices_cat, "police_prct")

In [12]:
police_prct_fg_meta = fs.create_feature_group("police_prct",
                        version=1,
                        description="Police Precint features",
                        primary_key=['police_prct', 'owner_type', 'building_class'],
                        statistics_config={'histograms': True, 'correlations': True},
                        time_travel_format=None,
                        online_enabled=True)

police_prct_fg_meta.save(police_prct_fg)

<hsfs.feature_group.FeatureGroup object at 0x7fa68b035e90>