# Predicting House Prices

This notebook will take you through the process of setting up a workflow that featurizes a relatively complex [housing price dataset](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques), creating 4 models th predict housing prices, and ensembling the results by taking the average of the 4 models. If you're curious, the original Kaggle competition has a full description of the dataset. 

**You can find and download this notebook on GitHub [here](https://github.com/aqueducthq/aqueduct/blob/main/examples/house-price-prediction/House%20Price%20Prediciton.ipynb).**

The credit for all the feature engineering that's done here goes to use Serigne on Kaggle, who put together this wonderful [notebook](https://www.kaggle.com/code/serigne/stacked-regressions-top-4-on-leaderboard/notebook) for this competition. 

**Throughout this notebook, you'll see a decorator (`@aq.op`) above functions. This decorator allows Aqueduct to run your functions as a part of a workflow automatically.**

In [1]:
import aqueduct as aq
import pandas as pd
import numpy as np

# If you're running your notebook on a separate machine from your
# Aqueduct server, change this to the address of your Aqueduct server.
address = "http://localhost:8080"

# If you're running youre notebook on a separate machine from your
# Aqueduct server, you will have to copy your API key here rather than
# using `get_apikey()`.
api_key = aq.get_apikey()
client = aq.Client(api_key, address)

In [2]:
# First we'll load in our training data from a CSV file that's stored on
# the Aqueduct GitHub repo.
train_data = pd.read_csv("data/train.csv")
train_data

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,8,2007,WD,Normal,175000
1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,Inside,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,Inside,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,4,2010,WD,Normal,142125


There's a whole bunch of columns here describing the housing data, everything from the zoning of the area, the type of lot, the utilities available to the house, whether it has a basement or a pool, and its square footage. We're not going to describe all of the data here, but you can check out the original [Kaggle competition](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques) if you're curious. 

With all this data, we're naturally going to need to do some data cleaning. The first thing we're going to do is fill in any missing values with reasonable defaults in the `fill_missing_data` function below. Some of our columns will be filled with a categorical `None`, some will be filled with a 0, and others will be filled with the modal or median value of the column. There's a lot of data here, so we're not going to dive into exactly why, but if you're curious, check out the notebook written by Serigne that explains why [here](https://www.kaggle.com/code/serigne/stacked-regressions-top-4-on-leaderboard/data).

In [3]:
# The @op decorator here allows Aqueduct to run this function as
# a part of an Aqueduct workflow. It tells Aqueduct that when
# we execute this function, we're defining a step in the workflow.
# While the results can be retrieved immediately, nothing is
# published until we call `publish_flow()` below.
@aq.op
def fill_missing_data(raw_house_data):
    """
    This function fills in any missing data from our housing dataset. Depending on the
    type of columns (categorical or numerical) and it's distributional properties (if numerical)
    we either fill missing values with None, with a zero, or with the modal value of the
    column. In addition, we:
    - drop the Utilities column because it is same for all but one of the houses
    - drop the SalePrice column because it's our prediction target
    - calculate the total square footage of the house by summing the basement and 1st and 2nd floors' SF
    """
    if "SalePrice" in raw_house_data:
        raw_house_data = raw_house_data.drop(["SalePrice"], axis=1)

    none_cols = [
        "PoolQC",
        "MiscFeature",
        "Alley",
        "Fence",
        "FireplaceQu",
        "GarageType",
        "GarageFinish",
        "GarageQual",
        "GarageCond",
        "BsmtQual",
        "BsmtCond",
        "BsmtExposure",
        "BsmtFinType1",
        "BsmtFinType2",
        "MasVnrType",
        "MSSubClass",
    ]
    for col in none_cols:
        raw_house_data[col].fillna("None", inplace=True)

    zero_cols = [
        "GarageYrBlt",
        "GarageArea",
        "GarageCars",
        "BsmtFinSF1",
        "BsmtFinSF2",
        "BsmtUnfSF",
        "TotalBsmtSF",
        "BsmtFullBath",
        "BsmtHalfBath",
        "MasVnrArea",
    ]
    for col in zero_cols:
        raw_house_data[col].fillna(0, inplace=True)

    modal_cols = ["MSZoning", "Electrical", "KitchenQual", "Exterior1st", "Exterior2nd", "SaleType"]

    for col in modal_cols:
        raw_house_data[col].fillna(raw_house_data[col].mode()[0], inplace=True)

    raw_house_data.drop(["Utilities"], axis=1, inplace=True)
    raw_house_data["Functional"].fillna("Typ", inplace=True)

    raw_house_data["LotFrontage"] = raw_house_data.groupby("Neighborhood")["LotFrontage"].transform(
        lambda x: x.fillna(x.median())
    )

    raw_house_data["TotalSF"] = (
        raw_house_data["TotalBsmtSF"] + raw_house_data["1stFlrSF"] + raw_house_data["2ndFlrSF"]
    )

    return raw_house_data

In [4]:
# Calling `.local()` on an @op-annotated function allows us to execute the
# function locally for testing purposes. When a function is called with
# `.local()`, Aqueduct does not capture the function execution as a part of
# the definition of a workflow.
filled = fill_missing_data.local(train_data)

Now that we have our data cleaned, the next thing we need to do is encode our categorical columns. Some of our categorical columns are going to have ordinal properties, where the categories are meaningfully ordered, and so we use scikit-learn's `LabelEncoder` which encodes values in the order that they are seen. 

In [5]:
@aq.op
def encode_labels(cleaned_data):
    """
    In this function, we take a subset of our categorical cols (listed in the
    `categorical_cols` variable below), and we use scikit-learn's LabelEncoder to
    encode the values in those columns because the category values here are
    ordinally meaningful.
    """
    cleaned_data["MSSubClass"].astype(str, copy=False)
    cleaned_data["OverallCond"].astype(str, copy=False)
    cleaned_data["YrSold"].astype(str, copy=False)
    cleaned_data["MoSold"].astype(str, copy=False)

    from sklearn.preprocessing import LabelEncoder

    categorical_cols = [
        "FireplaceQu",
        "BsmtQual",
        "BsmtCond",
        "GarageQual",
        "GarageCond",
        "ExterQual",
        "ExterCond",
        "HeatingQC",
        "PoolQC",
        "KitchenQual",
        "BsmtFinType1",
        "BsmtFinType2",
        "Functional",
        "Fence",
        "BsmtExposure",
        "GarageFinish",
        "LandSlope",
        "LotShape",
        "PavedDrive",
        "Street",
        "Alley",
        "CentralAir",
        "MSSubClass",
        "OverallCond",
        "YrSold",
        "MoSold",
    ]

    for col in categorical_cols:
        enc = LabelEncoder()
        enc.fit(list(cleaned_data[col].values))
        cleaned_data[col] = enc.transform(list(cleaned_data[col].values))

    return cleaned_data

In [6]:
encoded = encode_labels.local(filled)

Next, we'll unskew our numerical features:

In [7]:
@aq.op
def unskew_features(all_features):
    """
    This function uses scipy's boxcox1p method to unskew any numerical columns
    that have a skewness of at least 0.75, as calculated by scipy.stats.skew.
    In our training dataset, this function unskews 59 of the numerical features.
    """
    from scipy.special import boxcox1p
    from scipy.stats import skew

    numeric_feats = all_features.dtypes[all_features.dtypes != "object"].index
    skewed_feats = (
        all_features[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
    )

    skewness = pd.DataFrame({"Skew": skewed_feats})
    skewness = skewness[abs(skewness) > 0.75]

    skewed_features = skewness.index
    lam = 0.15
    for feat in skewed_features:
        all_features[feat] = boxcox1p(all_features[feat], lam)

    return all_features

In [8]:
unskewed = unskew_features.local(encoded)

Finally, we'll one-hot encode any remaining categorical variables:

In [9]:
@aq.op
def one_hot_encode(all_features):
    """
    This function takes in an almost-featurized housing dataset and uses
    pandas' `get_dummies` function to one-hot encode any categorical variables
    that haven't previously been touched by our featurization process.
    """
    return pd.get_dummies(all_features)

In [10]:
featurized = one_hot_encode.local(unskewed)

Now that we've finished featurizing our dataset, we're going to train a few models on our training data. We will train an ElasticNet model, a LASSO Regression model, a Gradient Boosting Regressor, and a Kernel Ridge Regessor model, and we'll put all of them in a list called `models`:

In [11]:
from sklearn.linear_model import ElasticNet, Lasso
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.preprocessing import RobustScaler
from sklearn.pipeline import make_pipeline

# Train 4 different kinds of scikit-learn models on our training data.
# For the reasons why these 4 models were chosen, you can check out the
# original Kaggle competition linked earlier in this notebook.
ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005, l1_ratio=0.9, random_state=3))
lasso = make_pipeline(RobustScaler(), Lasso(alpha=0.0005, random_state=1))
KRR = KernelRidge(alpha=0.6, kernel="polynomial", degree=2, coef0=2.5)
GBoost = GradientBoostingRegressor(
    n_estimators=3000,
    learning_rate=0.05,
    max_depth=4,
    max_features="sqrt",
    min_samples_leaf=15,
    min_samples_split=10,
    loss="huber",
    random_state=5,
)

models = [ENet, lasso, KRR, GBoost]

# Here, we use the SalePrice from the original training dataset as our y
# column, and we used the featurized DF we've created thus far as our X
# matrix.
y = train_data["SalePrice"]
X = featurized

for model in models:
    model.fit(X, y)

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


Finally, now that we've trained our models, we can write a prediction function. In this case, our prediction function is going to be quite simple: We're going to take all four of the models we trained above, we're going to compute their house price predictions, and we're going to compute the average of the 4 models' predictions:

In [12]:
@aq.op
def predict(raw_data, featurized_data):
    """
    This function takes in a fully featurized dataset, and it returns the average of the
    house price predictions made by the 4 models we've used. The average is unweighted, and it
    is computed by column-stacking the 4 prediction vectors, taking the average of the
    4 predictions.

    The results of this function are appended to the end of the original dataset.
    """
    predictions = []
    for model in models:
        predictions.append(model.predict(featurized_data))

    predictions = np.column_stack(predictions)
    raw_data["PredictedSalePrice"] = np.mean(predictions, axis=1)

    return raw_data

Now that we've defined all of our functions, we can create a full Aqueduct pipeline in a few function calls. First, we load the latest copy (rather than our previous snapshot) of the housing data from the Aqueduct demo database. In this example, the data is the same, but presumably, your database or data warehouse gets updated more often than our examples. 🙂

We then invoke the Aqueduct operators that we defined above, and Aqueduct will automatically construct a pipeline for us. At the end of this cell, we see a preview of our predicted sale prices:

In [13]:
demo_db = client.integration("aqueduct_demo")
raw_data = demo_db.sql("select * from house_prices;")

filled_data = fill_missing_data(raw_data)
encoded_data = encode_labels(filled_data)
unskewed_data = unskew_features(encoded_data)
featurized_data = one_hot_encode(unskewed_data)

predictions = predict(raw_data, featurized_data)

df = predictions.get()
df[["PredictedSalePrice"]]

Unnamed: 0,PredictedSalePrice
0,212066.834996
1,185468.400665
2,215006.389258
3,164454.782472
4,292068.915282
...,...
2914,84032.078128
2915,66575.662944
2916,170290.855793
2917,113928.671777


In [14]:
# This tells Aqueduct to save the results in predictions
# back to the demo DB we configured earlier, into a table called
# predicted_house_prices.
# NOTE: At this point, no data is actually saved! This is just
# part of a workflow spec that will be executed once the workflow
# is published below.
demo_db.save(predictions, table_name="predicted_house_prices", update_mode="replace")

And we're done! We can now call `publish_flow`, give our workflow a name, and tell Aqueduct which artifacts to publish as a part of it — in this case our `predictions`. Aqueduct will automatically detect everything that is required to publish those artifacts and create a workflow out of the code we've written in this notebook. If you navigate to the link provided in the response, you will see an interactive graph of your workflow that allows you to see the code that ran, the data it generated, and any logs or error messages. 

In [15]:
# This publishes all of the logic needed to create predicted_mpg
# and rmse to Aqueduct. The URL below will take you to the
# Aqueduct UI, which will show you the status of your workflow
# runs and allow you to inspect them.
client.publish_flow(name="House Price Predictor", artifacts=[predictions])

Url:  http://localhost:8080/workflow/d2d705f7-4bf8-4425-9455-b3555186c49b


<aqueduct.flow.Flow at 0x147cffcd0>