In [1]:
from pyspark.sql.functions import *
from pyspark.ml import *
from pyspark.ml.feature import *

# Feature engineering/ Pipeline building
The purpose of this notebook is to walk through the steps of each created feature and of building the pipeline.

# Initializing Spark

In [2]:
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.config("spark.driver.memory", "2g").getOrCreate()
)

# Defining a schema and loading the data
* You can check the input schema definition in [house_price_predictor/schema.py](house_price_predictor/schema.py).
* I prefer to use Doubles for all features instead of using Integers for some features, because it's more flexible and avoids the pipeline erroring out in case of noise.
* Even though waterfront could be boolean I will treat it as numerical because it's easier for imputation
* The same with all other categorical features (e.g., such as condition or grade), because it was clear from the data visualization that their order is a predictor of value

In [3]:
from pyspark.sql.types import *
from house_price_predictor.schema import schema
df = spark.read.csv("data/kc_house_data.csv", header=True, schema=schema, mode="PERMISSIVE")

# Creating new features
* For the source code check the files [house_price_predictor/features/sql_transformers.py](house_price_predictor/features/sql_transformers.py) and [house_price_predictor/features/zipcode_ranker.py](house_price_predictor/features/zipcode_ranker.py).

## Convert date
* SQLTransformers are the most straightforward way of creating a serializable Spark Pipelines, so I will be using as much as I can

In [4]:
from house_price_predictor.features.sql_transformers import *

colname, transformer = date_converter()
df = transformer.transform(df)
df.select("date", colname).show(5)

+---------------+-------------------+
|           date|     converted_date|
+---------------+-------------------+
|20141013T000000|2014-10-13 00:00:00|
|20141209T000000|2014-12-09 00:00:00|
|20150225T000000|2015-02-25 00:00:00|
|20141209T000000|2014-12-09 00:00:00|
|20150218T000000|2015-02-18 00:00:00|
+---------------+-------------------+
only showing top 5 rows



## Age of construction feature
* This one probably doesn't bring much value, but it's useful to compare with `renovation_age` below

In [5]:
colname, transformer = construction_age_creator()
df = transformer.transform(df)
df.select("date", "yr_built", colname).show(5)

+---------------+--------+----------------+
|           date|yr_built|construction_age|
+---------------+--------+----------------+
|20141013T000000|    1955|              59|
|20141209T000000|    1951|              63|
|20150225T000000|    1933|              82|
|20141209T000000|    1965|              49|
|20150218T000000|    1987|              28|
+---------------+--------+----------------+
only showing top 5 rows



## Age of renovation
* As mentioned in the [data viz notebook](00_data_viz.ipynb), the `yr_renovated` is imputed with 0's, so I am creating a `renovation_age` feature to substitute it.
* When creating a renovation age feature, it makes sense that its maximum value (in case `yr_renovated == 0`) is the construction age.

In [6]:
colname, transformer = renovation_age_creator()
df = transformer.transform(df)
df.select("date", "yr_built", "yr_renovated", "construction_age", colname).show(5)
df.select("date", "yr_built", "yr_renovated", "construction_age", colname).where(col("yr_renovated") != 0).show(5)

+---------------+--------+------------+----------------+--------------+
|           date|yr_built|yr_renovated|construction_age|renovation_age|
+---------------+--------+------------+----------------+--------------+
|20141013T000000|    1955|           0|              59|            59|
|20141209T000000|    1951|        1991|              63|            23|
|20150225T000000|    1933|           0|              82|            82|
|20141209T000000|    1965|           0|              49|            49|
|20150218T000000|    1987|           0|              28|            28|
+---------------+--------+------------+----------------+--------------+
only showing top 5 rows

+---------------+--------+------------+----------------+--------------+
|           date|yr_built|yr_renovated|construction_age|renovation_age|
+---------------+--------+------------+----------------+--------------+
|20141209T000000|    1951|        1991|              63|            23|
|20140613T000000|    1930|        2002|

## Zipcode

* The`zipcode_price_rank` feature represents the position of a given zipcode in the rank of most expensive, on average, zipcodes (being 0 the most expensive zipcode). 
* You can check the code in this file [house_price_predictor/features/zipcode_ranker.py](house_price_predictor/features/zipcode_ranker.py)
* One concern here is label leakage. But I will be ensuring this is only fitted with the train dataset by using a Pipeline, so no problem.
* This approach is much lighter than OneHotEncoding of the zipcodes.
* Using the SQLTransformer was not an option because it would require SQL UDFs and these cannot be persisted.

In [7]:
from house_price_predictor.features.zipcode_ranker import ZipcodeRanker
zipcode_ranker = ZipcodeRanker().fit(df)
df = zipcode_ranker.transform(df)
df.select("zipcode", "zipcode_price_rank").show(10)

[Stage 12:>                                                         (0 + 1) / 1]

+-------+------------------+
|zipcode|zipcode_price_rank|
+-------+------------------+
|  98178|                57|
|  98125|                37|
|  98028|                39|
|  98136|                31|
|  98074|                13|
|  98053|                15|
|  98003|                62|
|  98198|                59|
|  98146|                50|
|  98038|                49|
+-------+------------------+
only showing top 10 rows



                                                                                

# Building the pipeline
* See the source code in [house_price_predictor/features/pipeline.py](house_price_predictor/features/pipeline.py)
* Using an Imputer with the mean values because these should impact less the model when filling values
* Will scale all the features because it's necessary to apply PCA
* I am using std scaling because it keeps a realistic proportion of the data, without getting squashed by outliers like min-max.


In [8]:
df = spark.read.csv("data/kc_house_data.csv", header=True, schema=schema, mode="PERMISSIVE")

In [9]:
from house_price_predictor.features.pipeline import create_fitted_pipeline
pipeline = create_fitted_pipeline(dataset=df)

                                                                                

* Since the features are highly correlated with one another, we are using PCA to avoid further feature selection and speed-up model training. By suming the values of the vector of explained variances, I could observe that the first 13 Principal Componentes contained 95.7% of the explained variance. That's very reasonable, so it will be the number of K we'll be using.

In [10]:
pca = pipeline.stages[7]
s = 0
for idx, expl_var in enumerate(pca.explainedVariance):
    s += expl_var
    print(f"PC {idx + 1}: {s}")

PC 1: 0.31292743366538406
PC 2: 0.4451170781945967
PC 3: 0.5505288157224567
PC 4: 0.6256772364262979
PC 5: 0.6934923788496593
PC 6: 0.7459239834052643
PC 7: 0.7933722400006185
PC 8: 0.8324624430810724
PC 9: 0.868835054647886
PC 10: 0.8965484493427118
PC 11: 0.9198708947881981
PC 12: 0.9412332098496586
PC 13: 0.9570944891559053


* This is an example of the final feature array of a single row (with which a model will be trained).

In [11]:
pipeline.transform(df).select("features").toPandas().iloc[0,0]

                                                                                

DenseVector([2.6959, -0.3085, 0.3625, -0.7628, 0.351, 0.2867, -0.321, 0.9568, -0.0614, -0.1224, -1.2589, 0.0829, 0.1405])