# King County Housing Regression
### Buy Low Sell High

Author : Wong Zhao Wu, Bryan

## Modelling Objective
Perform EDA and Modelling to find the optimal solution in estimating the housing prices by minimizing Root Mean Squared Error as the primary metrics.

## Keywords
- Supervised Learning
- Regression
- Feature Engineering
- Hierarchical Clustering
- Gradient Boosted Trees

In [None]:
import pandas as pd
import numpy as np
import scipy.stats as ss
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
import folium
from branca.element import Figure

# Reading Dataset
The [datasets](https://geodacenter.github.io/data-and-lab//KingCounty-HouseSales2015/) includes the home sales prices and characteristics for Seattle and King County, WA (May 2014 - 2015).

## Data Dictionary
|Columns| Description |
|:---|:---|
|id | Unique ID for each home sold|
|date | Date of the home sale|
|price | Price of each home sold|
|bedrooms | Number of bedrooms|
|bathrooms | Number of bathrooms, where .5 accounts for a room with a toilet but no shower|
|sqft_living | Square footage of the apartments interior living space|
|sqft_lot | Square footage of the land space|
|floors | Number of floors|
|waterfront | A dummy variable for whether the apartment was overlooking the waterfront or not|
|view | An index from 0 to 4 of how good the view of the property was|
|condition | An index from 1 to 5 on the condition of the apartment|
|grade | An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.|
|sqft_above | The square footage of the interior housing space that is above ground level|
|sqft_basement | The square footage of the interior housing space that is below ground level|
|yr_built | The year the house was initially built|
|yr_renovated | The year of the house’s last renovation|
|zipcode | What zipcode area the house is in|
|lat | Lattitude|
|long | Longitude|
|sqft_living15 | The square footage of interior housing living space for the nearest 15 neighbors|
|sqft_lot15 | The square footage of the land lots of the nearest 15 neighbors|

Additional [geoJSON](https://gis-kingcounty.opendata.arcgis.com/datasets/zipcodes-for-king-county-and-surrounding-area-shorelines-zipcode-shore-area) file is obtained for purpose of insights gaining and data visualisation.

In [None]:
housing_df = pd.read_csv("../input/housesalesprediction/kc_house_data.csv")
housing_df.head()

# Exploratory Data Analysis

## Descriptive Summaries
By running `.info()` on our dataframe, the following are the initial observation
of the dataset.

**Observations**

1. The shape of dataset is `(21612, 21)` whereby there is 21612 observations and 21 columns. 
(21 Features + 1 Target Variable: `"price"`)
2. Datatype for all columns are accurate except for `"date"` which can be converted to DateTime64 for better visualisation.
3. No Missing Values are observed at a glance.

In [None]:
housing_df.info()

### Statistical Summaries
We explore the variables through `.describe()` to get a rough sense of the mean, median, standard deviation of the data to spot for potential outliers and skewness in the data.
**Observations**

1. Extreme Outliers

    By comparing the Max to 75% and Min to 25%, we notice a few extreme outliers that might need to take care of.
    - sqft_living : Sudden jump of sqft_living from 2550sqft at 75% to 13450sqft at maximum flags the presense of extreme outliers.
    - sqft_above : Sudden jump of sqft_above from 2210sqft at 75% to 9410sqft at maximum flags the presense of extreme outliers.
    - sqft_lot15 : Sudden jump of sqft_lot15 from 10083sqft at 75% to 871200sqft at maximum flags the presense of extreme outliers.

2. Data Skewness

    By comparing mean, median, 25%, 75%, min and max values, we can get rough sense of which columns might suffers from data skewness.
    - sqft_living
    - sqft_above
    - sqft_lot15
    
3. Missing Values

    For ["yr_renovated","sqft_basement"], the minimum values of 0 is observed. However, it is not logical for a year and basement to be zero which means that null value is presense in ["yr_renovated","sqft_basement"] columns

Further visualisation is needed to better understands the data and spot potential pain point that we might need to work on for model improvement.

In [None]:
housing_df.describe().drop(columns = 'id') #id is dropped temporarily as it does not provide useful information

In [None]:
housing_df.yr_renovated.replace(0, np.nan, inplace = True)
housing_df.sqft_basement.replace(0, np.nan, inplace = True)

## Graphical Summaries


### Pairplots
We plot out the pairplots to study the general distribution of numerical columns as well as its interaction and relationships with other columns.

**Observations**

1. Positively Skewed Features
    
    `["sqft_living15", "sqft_above", "sqft_living", "price"]`
    > Log Transformation might be required to unskew the data.

2. Positive Linear Relationship with target variable, `"price"`

    `["sqft_living", "sqft_above", "sqft_living15"]`
    > This indicates that the features mentioned might be useful in prediction of target variable

3. Extreme Outliers
    From the histogram of sqft_lot and its scatter plots,  we noticed that most of its datapoints are at lower-end region with multiple extreme outliers.
    `["sqft_lot"]`
    > We might need to consider dropping the outliers

In [None]:
sns.pairplot(housing_df[
    ['price', 'sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement','sqft_living15', 'sqft_lot15']
])
plt.show()

### Boxplots
Boxplot is plotted to better visualise the distribution of numerical columns along with its outliers.

**Observations:**

1. Extreme outliers is observed in `["sqft_basement", "sqft_lot15", "sqft_lot"]` .

In [None]:
columns = ['price', 'sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement','sqft_living15', 'sqft_lot15']
number_of_columns=4
number_of_rows = len(columns)-1/number_of_columns
plt.figure(figsize=(3*number_of_columns,5*number_of_rows))
for i in range(0,len(columns)):
    plt.subplot(number_of_rows + 1,number_of_columns,i+1)
    sns.set_style('whitegrid')
    sns.boxplot(y=columns[i], data=housing_df ,color='green')
    plt.tight_layout()

### Housing Price with Time

Since date column is included in the dataset, we can visualise the general trend of the housing prices to discover any interesting insights.

**Observations**
1. Housing prices seems to be fructuating throughout the months without any visable trend that we can observe.
2. The sudden spike around october 2014 is due to presense of outliers than general trend of the data.

In [None]:
housing_df['date'] = pd.to_datetime(housing_df['date'])
fig, ax = plt.subplots(figsize=(9,5))
sns.lineplot(x="date", y="price", data = housing_df, estimator = np.median)
plt.show()

In [None]:
prices = housing_df.groupby(pd.Grouper(key='date', freq='M'))['price']
price_month = {}
for name, price in prices:
    price_month[name] = price

fig, ax = plt.subplots(figsize=(9,5))
sns.boxplot(data = pd.DataFrame(price_month))
plt.xticks(rotation=90)
plt.xlabel("Date")
plt.ylabel("Price")
plt.show()

"yr_renovated" seems to form a positive linear relationship towards "price". As for "yr_built", the linear relationship is weaker and not as obvious.

In [None]:
fig, ax = plt.subplots(figsize=(9,7))
housing_df.yr_renovated.replace(0, np.nan, inplace= True) # Replace 0 as null values for year_renovated
sns.lineplot(x="yr_built", y="price", data = housing_df)
sns.lineplot(x="yr_renovated", y="price", data = housing_df)
plt.show()

### Housing Price with Geo-Location

Since we are given the longitude and latitude as well as the zipcode of the house, we can visualise the median housing prizes along with its geo-location to visualise any trends in the housing prices.

**Observations**
1. Housing near the waterfront area seems to have higher median prices as compared to housing further away from waterfront.
2. The waterfront label does not 

In [None]:
seattlemap_fig=Figure(width=750,height=450)
zipcode_median = housing_df.groupby("zipcode")[["price"]].median().reset_index()

seattle_map = folium.Map(location=[47.6062, -122.3321 ], zoom_start=7)
seattlemap_fig.add_child(seattle_map)

folium.Choropleth(
    geo_data='dataset/Zip_Codes.geojson',
    data=zipcode_median,
    columns=['zipcode', 'price'],
    key_on='feature.properties.ZIP',
    fill_color='YlOrRd', 
    nan_fill_color = "#ffffff",
    fill_opacity=1, 
    line_opacity=1,
    smooth_factor=0
).add_to(seattle_map)

seattle_map

By visualising the waterfront label from the dataset, we observed that only certain area is flagged as waterfront and it sort of miss out the hotspot visualise from the color pricing map above.

Hence, perhaps feature engineering is needed to better help the model to understand the relation of its geographic location to the pricing

In [None]:
seattlemap_fig=Figure(width=750,height=450)
zipcode_waterfront = housing_df.groupby("zipcode")[["waterfront"]].mean().reset_index()

seattle_map = folium.Map(location=[47.6062, -122.3321 ], zoom_start=7)
seattlemap_fig.add_child(seattle_map)

folium.Choropleth(
    geo_data='dataset/Zip_Codes.geojson',
    data=zipcode_waterfront,
    columns=['zipcode', 'waterfront'],
    key_on='feature.properties.ZIP',
#     fill_color='YlOrRd', 
    nan_fill_color = "#ffffff",
    fill_opacity=1, 
    line_opacity=1,
    smooth_factor=0
).add_to(seattle_map)

seattle_map

### Housing Price with Categorical Features

There are several categorical columns observed in the dataset, we can visualise the categorical features along with the median price to visualise the impact of them towards the target variable, price.

Categorical Columns : ["bedrooms", "bathrooms", "floors", "waterfront", "view", "grade"]

**Observations**
1. All categorical features seems to pocess certain positive linear relationship towards median prices except for "floor" which seemed to be relatively constant for each floor value.

In [None]:
cat_col = ["bedrooms", "bathrooms", "floors", "waterfront", "view", "grade", "condition"]
plt.figure(figsize=(15,10))
for i, col in enumerate(cat_col):
    sns.set_palette(sns.color_palette("Paired"))
    ax = plt.subplot(3,3,i+1)
    sns.barplot(x=col, y ='price', estimator = np.median, data = housing_df, ax = ax)
    sns.set_style('whitegrid')
    plt.xticks(rotation=90)
    plt.ylabel("Median Price")
    plt.tight_layout()
plt.show()

## Correlation Matrix
From pearson correlation matrix, we observed that ["sqft_lot", "condition", "yr_built", "sqft_lot15"] has correlation $p<0.1$ which indicates weak positive linear relationship against target variable, price. The result shown is consistent with the observation made through the pairplots.

Besides, we should also be awared of the high linear correlation between variables sqft_living-sqft_above, grade-sqft_living which can cause potential unstableness for linear model.

In [None]:
plt.figure(figsize=(12,12))
sns.heatmap(
    housing_df.drop(columns = ['id','date','zipcode','long','lat']).corr().abs(),
    cmap="cubehelix_r",
    annot = True)
plt.show()

# Data Preprocessing

After EDA, we have identified the following flaws in our dataset.
1. Dropping Columns that provides no information to target variable
2. Missing Values in yr_renovated, sqft_basement
3. Highly skewed columns with extreme outliers
4. Weak linear correlation between certain columns (Evaluate Further After Modelling)

## Dropping Columns

We will drop the following columns as it does not reviews any relationship with the target variable.
- Id
- Date
- Zipcode (Although geolocation seemed to be quite useful in our prediction, zipcode is first dropped as it does not directly provide information to the models)

In [None]:
housing_df_dropped = housing_df.drop(columns=['id', 'date', 'zipcode'])

## Handling Missing Values
As there are more than around 95% and 60.73% of data points are missing from the ["yr_renovated","sqft_basement"] columns, the safest approach is to drop both columns entirely. We can evaluate this decision further during model improvement by trying other imputing methods like filling it with central tendency or perform iterative imputation.

In [None]:
print("Percentage of Missing Values in yr_renovated column:\n{:.2f}%".format(housing_df_dropped.yr_renovated.isnull().sum()*100 / housing_df_dropped.shape[0]))
print("Percentage of Missing Values in sqft_basement column:\n{:.2f}%".format(housing_df_dropped.sqft_basement.isnull().sum()*100 / housing_df_dropped.shape[0]))

In [None]:
housing_df_dropped.drop(columns = "yr_renovated", inplace = True) # Drop yr_renovated column which has 95% of null values
housing_df_dropped.drop(columns = "sqft_basement", inplace = True) # Drop yr_renovated column which has 60% of null values
print("yr_renovated" in housing_df_dropped.columns)
print("sqft_basement" in housing_df_dropped.columns)

## Log Transformation
We perform log transformation to unskewed the feature columns that are highly skewed with extreme outliers. Histograms are subsequently plotted to evaluate the effectiveness of the approach.

In [None]:
highly_skewed_columns = ['sqft_living', 'sqft_lot', 'sqft_above','sqft_living15', 'sqft_lot15']
housing_df_dropped[highly_skewed_columns] = np.log(housing_df_dropped[highly_skewed_columns])

In [None]:
number_of_columns=4
number_of_rows = len(highly_skewed_columns)-1/number_of_columns
plt.figure(figsize=(3*number_of_columns,5*number_of_rows))
for i in range(0,len(highly_skewed_columns)):
    plt.subplot(number_of_rows + 1,number_of_columns,i+1)
    sns.set_style('whitegrid')
    sns.boxplot(y=highly_skewed_columns[i], data=housing_df_dropped ,color='green')
    plt.tight_layout()

After log transformation, the columns appear to be more normalised and unskewed. However, there are still alot of outliers observed in the columns. For now, we can leave it as it is and evaluate further during model improvement later.

## Categorical Columns
Since all categorical columns are ordinal categorical data, there isin't much data preprocessing needed to be done as the data are ready to be parse into a machine learning model.

In [None]:
from sklearn.feature_selection import RFECV, SelectFromModel
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, RandomizedSearchCV
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, BaggingRegressor, AdaBoostRegressor, ExtraTreesRegressor
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error, mean_absolute_percentage_error

## Train-Test-Split
We split our dataset into train and test set to better evaluate the performance of the model by examining its performance against the train set and test set.

In [None]:
X = housing_df_dropped.drop(columns = 'price')
y = housing_df_dropped['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 12)

# Modelling

## Baseline Classifier
We make use of dummy regressor as baseline model by always predicting the mean of target variable.The baseline predictor serves as reference point for model selection.

In [None]:
def model_evaluation(clf, X_train, X_test, y_train, y_test, scoring=[mean_squared_error, mean_absolute_error, r2_score],
                     columns = ["mse", "mae", "r2"], trainvstest = True ,graph = True, rmse = True, return_table = True, log=False):
        hist = []
        pred_train = clf.predict(X_train)
        pred_test = clf.predict(X_test)
        
        
        if log: # if y label is log scaled
            y_train = np.exp(y_train) # Inverse Log Scale for training data
            pred_train = np.exp(pred_train) # Inverse of Log Scale for training set
            pred_test = np.exp(pred_test) # Inverse of Log Scale for training set
        
        for score in scoring:
            if rmse and score == mean_squared_error:
                if trainvstest: # if compare both train and test set
                   hist.append(round(score(y_train, pred_train, squared=False),2))
                hist.append(round(score(y_test, pred_test, squared=False),2))
                continue
            if trainvstest:
                hist.append(round(score(y_train, pred_train),2))
            hist.append(round(score(y_test, pred_test),2))
        cols = columns.copy()
        if rmse:
            try:
                cols[cols.index("mse")] = 'rmse'
            except:
                assert False,"Mean Squared Error is missing from columns"
            
        if trainvstest:
            cols = np.array([["train_"+col, "test_"+col] for col in cols]).flatten()
        
        if graph:
            fig, ax = plt.subplots(figsize=(7,7))
            plt.scatter(y_test, pred_test, c='crimson')
            p1 = max(max(pred_test), max(y_test))
            p2 = min(min(pred_test), min(y_test))
            plt.plot([p1, p2], [p1, p2], 'b-')
            plt.xlabel('True Values', fontsize=15)
            plt.ylabel('Predictions', fontsize=15)
            ax.set_aspect('equal')
            plt.show()
            
        if return_table:
            return pd.DataFrame([hist], columns=cols, index = [type(clf).__name__])
        else:
            return type(clf).__name__, hist, cols

In [None]:
dummy = DummyRegressor()
dummy.fit(X_train, y_train)
model_evaluation(dummy, X_train, X_test, y_train, y_test)

## Model Selection
After building our baseline model and evaluation function, we continue to test out different model in sklearn and decide which model fits better to the dataset.

In [None]:
def models_selection(clfs:list, X_train, X_test, y_train, y_test, 
                     scoring=[mean_squared_error, mean_absolute_error, r2_score],
                     columns = ["mse", "mae", "r2"], trainvstest = True,
                     graph = False, rmse = True, return_table = False):
    hists = []
    model_names = []
    for clf in tqdm(clfs):
        clf.fit(X_train, y_train)
        model_name, hist, col_name = model_evaluation(clf, X_train, X_test, y_train, y_test, 
                         scoring=scoring,columns = columns, trainvstest = trainvstest,
                         graph = graph, rmse = rmse, return_table = return_table)
        model_names.append(model_name)
        hists.append(hist)
    
    return pd.DataFrame(hists, columns = col_name, index = model_names)

In [None]:
models = [
    LinearRegression(), Lasso(), Ridge(), #linear_model
    KNeighborsRegressor(), #distance_based_model
    SVR(kernel = 'linear'), SVR(kernel = 'rbf'), SVR(kernel = 'poly', degree = 2), SVR(kernel = 'sigmoid'), #SVMs
    DecisionTreeRegressor(random_state = 12), #tree model
    RandomForestRegressor(random_state = 12), #ensemble tree models
    GradientBoostingRegressor(random_state = 12),
]
models_selection(models, X_train, X_test, y_train, y_test)

From the training outcome, the following are the observations for different family of models:
1. Linear Models(Linear Regression, Lasso[L1 Norm] , Ridge[L2 Norm]):
    - Does not seems to suffers from major overfitting
    - Might be suffering from underfitting due to inductive biases.
    > Try out polynomial regression and evaluate the model performance
2. Distance Based models(KNeighborsRegressor):
    - Suffer from major overfitting.
    - Might be suffering from slight underfitting.
    > Increase the number of neighbours to reduce overfitting
3. SVMs with (linear, rbf, poly, sigmoid) kernels:
    - Changing of Kernels does not seems to improve the performance of SVMs
    - SVMs suffers from high biases as the train and test set errors are relatively high
    > Either increase the model complexity by increasing regularization term $C$ or invest time on other models
4. DecisionTree: 
    - Suffers from major overfitting
    - Low biases and able to obtain relatively lower testing error
    > Tune model better to reduce overfitting by limiting the model complexity
5. Ensemble Tree Models (RandomForest, GradientBoosting)
    - Does not suffer as much overfitting than decision tree, however, overfitting still occurs
    - Low biases and able to obtain relatively lower testing error
    > Tune model better to reduce overfitting by increasing regularization
    
Conclusions:
1. Tree-based models can produce promising results after reducing overfitting. Hence, further model tuning is required to reduce overfitting.
2. Linear models suffers from high biases. Hence, we can try to increase the model complexity to reduce overfitting.

# Model Improvement
After preliminary modelling, we've shortlisted two category of models to work on, linear model as well as tree-based model. Before we start any feature engineering or elimination as well as hyperparameters tuning, we first train a decision tree regressor with default parameters to understand how it is doing as well as the important features flagged by the model.

In [None]:
tree = DecisionTreeRegressor(random_state = 12)

tree.fit(X_train, y_train)
model_evaluation(tree, X_train, X_test, y_train, y_test)

In [None]:
pd.DataFrame(list(zip(X_train.columns, tree.feature_importances_))).sort_values(1, ascending = False)

From the feature importance, the following are the observations:
    1. Features related to the housing details like grade, sqft_living, sqft_living15 seems to provide useful information about the price of the house.
    2. Geo-spatial information of the house like lat,long and waterfront are quite useful in predicting the pricing of house
    
Hence, further features engineering on the geo-spatial data might help improve the performance of model. Further feature cleaning and preprocessing towards features related to housing information. It might also be helpful for us to revisit the decisions made earlier and evaluate does it help us in getting better performance in our model.

## Train-Test-Dev Split
We re-split our data into Train(60%), Dev(20%), Test(20%) set for us to make decision based on the Dev set and finally evaluate the performance based on the test set.

In [None]:
X= housing_df.drop(columns = [
    'id', # No meaning
    'date', # Not useful
    'price', # Target Variable
    'yr_renovated', # High Null Values
    'sqft_basement' # High Null Values
])
y = housing_df['price']
highly_skewed_columns = ['sqft_living', 'sqft_lot', 'sqft_above','sqft_living15', 'sqft_lot15']
X[highly_skewed_columns] = np.log(X[highly_skewed_columns]) # Log transform columns with high number of outliers
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 42)
X_train, X_dev, y_train, y_dev = train_test_split(X_train, y_train, test_size = .25, random_state = 42)

## Feature Engineering
To improve the performance of model, we can perform some feature engineerig to bring out more information about the house that can help the model in its prediction.

### Geo-Spatial Data
There are a few features that reviews the pricing information through geo-spatial features which includes ["long", "lat", "zipcode", "waterfront"].

Although we have zipcode that reviews the region that the house is located in, the high cardinality in zipcode columns makes pure one-hot encoding expensive as it will subject to the Curse of Dimentionality.

Hence, it will be great if we can make use of the zipcode to identify a few cluster that has higher average house prizes and cluster with lower house prizes. With that aim in mind, we can make use of Unsupervised clustering to generate the clustering for us.

In [None]:
#To understand which features has higher absolute linear correlation towards price
housing_df.corr().abs()['price'].sort_values(ascending=False) 

In [None]:
# Extract the mean of features based on the zipcode
zipcode_df = X_train.groupby("zipcode").mean()[['grade', 'sqft_living', 'sqft_living15', 'bathrooms', 'sqft_above']]
zipcode_df

In [None]:
from scipy.cluster.hierarchy import dendrogram
from sklearn.cluster import AgglomerativeClustering


def plot_dendrogram(model, **kwargs):
    # Create linkage matrix and then plot the dendrogram

    # create the counts of samples under each node
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack([model.children_, model.distances_,
                                      counts]).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)

# setting distance_threshold=0 ensures we compute the full tree.
model = AgglomerativeClustering(distance_threshold=0, n_clusters=None)

model = model.fit(zipcode_df)
plt.title('Hierarchical Clustering Dendrogram')
# plot the top three levels of the dendrogram
plot_dendrogram(model, truncate_mode='level', p=3)
plt.hlines(2, 0, 300, colors = 'r')
plt.xlabel("Number of points in node (or index of point if no parenthesis).")
plt.show()

From the dendrogram, we decided to make the cut at y=2 so we will only have 4 main cluster left using the zipcode.

In [None]:
cluster_model = AgglomerativeClustering(n_clusters=4) # Setting Number of Cluster to 4
zipcode_df['zipcode_cluster'] = cluster_model.fit_predict(zipcode_df)
X_train = pd.merge(X_train, zipcode_df[['zipcode_cluster']].reset_index(), on='zipcode', how = 'left')
X_dev = pd.merge(X_dev, zipcode_df[['zipcode_cluster']].reset_index(), on='zipcode', how = 'left')
X_test = pd.merge(X_test, zipcode_df[['zipcode_cluster']].reset_index(), on='zipcode', how = 'left')
X_train

In [None]:
sns.barplot(x = 'zipcode_cluster', y = 'price',data = pd.concat([X_train,y_train], axis = 1))
plt.title("Mean Price against Zipcode Cluster")
plt.show()

Although at a glance, the relationship of zipcode_cluster with mean price might not be that obvious, the cluster allows us to review the information of zipcode without getting compromises with high number of columns. We can evaluate this decision at later timing. Note that zipcode_cluster is a nominal categorical data that requires one-hot encoding.

# Pipeline
We decided to make use of pipeline to wrap several preprocessing modules to preprocess the data before parsing it into the final model.

Pipeline Members:
1. ColumnTransformer : Split between different pipeline for categorical features(to be one-hot encoded) and numerical features(to be normalised).
2. OneHotEncoder : Convert Categorical Features into dummy variables (drop ='first').
3. PolynomialFeatures : Generate a new feature consisting of all polynomial combinations of the features.
4. StandardScaler : Ensures all features are at the same scale of mean = 0, standard deviation = 1
5. SelectFromModel : Make use of LASSO regression to make the features sparse and select the best features to be parsed to the final model
6. Final Predictor : Tree-based Regressor


In [None]:
cat_features = ['zipcode_cluster']
num_features = [col for col in X_train.columns if col not in cat_features]

num_features_pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("poly", PolynomialFeatures()),
])

cat_features_pipe = Pipeline([
    ("onehot", OneHotEncoder(drop='first'))
])

prep_pipe = ColumnTransformer([
    ("num", num_features_pipe, num_features),
    ("cat", cat_features_pipe, cat_features)
])

pipe = Pipeline([
    ("preprocess", prep_pipe),
    ("estimator", GradientBoostingRegressor(random_state = 12))
])

In [None]:
pipe.fit(X_train,y_train)

In [None]:
model_evaluation(pipe, X_train, X_dev, y_train, y_dev)

## Model Selection With Pipeline
After implementing pipeline into our final model, we've build a custom function that can loop through each estimator and replace it with the selected model.

In [None]:
def models_selection_pipeline(pipe, clfs:list, X_train, X_test, y_train, y_test, 
                     scoring=[mean_squared_error, mean_absolute_error, r2_score],
                     columns = ["mse", "mae", "r2"], trainvstest = True,
                     graph = False, rmse = True, return_table = False, log = False):
    hists = []
    model_names = []
    for clf in tqdm(clfs):
        pipe.steps.pop(-1) #Remove Final Estimator
        pipe.steps.append(['estimator',clf]) #Append Estimator into pipeline
        
        pipe.fit(X_train, y_train)
        model_name, hist, col_name = model_evaluation(pipe, X_train, X_test, y_train, y_test, 
                         scoring=scoring,columns = columns, trainvstest = trainvstest,
                         graph = graph, rmse = rmse, return_table = return_table, log = log)
        model_names.append(type(clf).__name__)
        hists.append(hist)
    
    return pd.DataFrame(hists, columns = col_name, index = model_names)

In [None]:
tree_models = [
    DecisionTreeRegressor(random_state = 12), 
    RandomForestRegressor(random_state = 12), 
    GradientBoostingRegressor(random_state = 12),
    BaggingRegressor(random_state = 12),
    AdaBoostRegressor(random_state = 12),
    ExtraTreesRegressor(random_state = 12)
]
models_selection_pipeline(pipe, tree_models, X_train, X_dev, y_train, y_dev)

From the result of model selection with pipeline, we noticed that GradientBoostingRegressor looks like hte best candidate with low biasses and low variance. We perform 5-Fold cross validation towards the GradientBoostingRegressor pipeline and evaluate the mean and std of root mean square error.

In [None]:
cv_hist = cross_validate(pipe, X_train, y_train, scoring = 'neg_root_mean_squared_error', n_jobs = -1)
pd.DataFrame(cv_hist).describe().rename(columns = {'test_score':'Root Mean Square'})[['Root Mean Square']].abs().T[['mean','std']]

From the mean and std generated from Cross Validation of GradientBoostingRegressor Pipeline, it seems that the result are still within the acceptable range as error from hold-out dev set does not deviate much from result of cross validation.

# Hyperparameter Tuning
From the model selection results with the pipeline, we've decided to explore hyperparamter tuning on GradientBoostingRegressor Pipeline since it looks like a promissing candidate with low bias and relatively low variance.

In [None]:
# pipe.steps.pop() # Remove estimator
# pipe.steps.append(('estimator', GradientBoostingRegressor())) # Replace estimator with GradientBoostingRegressor
# params = {
#     "estimator__n_estimators" : np.linspace(100,500,5, dtype = int),
#     "estimator__subsample" : [0.2, 0.5, 0.8, 1.0],
#     "estimator__min_samples_leaf" : [0.01],
#     "estimator__random_state" : [12],
#     "estimator__ccp_alpha" : [0, 0.1,0.2,0.3]
# }
# random_search_cv = RandomizedSearchCV(pipe, params,scoring = 'neg_root_mean_squared_error', n_iter = 20)
# random_search_cv.fit(X_train,y_train)

In [None]:
# random_search_cv.best_params_

## RandomSearchCV Hyperparameter Tuning Evaluation

After running hyperparams tuning for several hours, we make use of the best params generated from RandomizedSearchCV and build our final pipeline and perform final evaluation with hold-out dev set and test set along with 5-Fold Cross Validation

In [None]:
randomsearch_pipe = Pipeline([
    ("preprocess", prep_pipe),
    ("estimator", GradientBoostingRegressor(
        subsample= 1,
        random_state = 12,
        n_estimators = 500,
        min_samples_leaf= 0.01,
        ccp_alpha= 0.3
        )
    )
])

In [None]:
randomsearch_pipe.fit(X_train, y_train)

In [None]:
model_evaluation(randomsearch_pipe, X_train, X_dev, y_train, y_dev)

In [None]:
cv_hist = cross_validate(randomsearch_pipe, X_train, y_train, scoring = 'neg_root_mean_squared_error', n_jobs = -1)
pd.DataFrame(cv_hist).describe().rename(columns = {'test_score':'Root Mean Square'})[['Root Mean Square']].abs().T[['mean','std']]

After evaluating the model with hold-out dev set, 5-Fold cross-validation, we noticed that RandomizedCV hyperparameters tuning have reduced the dev set error as well as overall cross-validation error of a small margin at the cost of slight increase in variance of our model which can be noticed by the 2% increase of gap between train and test for the final tuned pipeline.
Due to the time constraint, I did not pursue further to try reduce the high variance caused after hyperparameters tuning.(As this is an assignment with deadline) 

Lastly, before wrapping up the model, we perform final evaluation with out hold-out test set as the final scrutiny of the ability to generalize of our model. 

In [None]:
model_evaluation(randomsearch_pipe, X_train, X_test, y_train, y_test)

From the test error, although it is slightly higher than the average 5-Fold error, it is still within 1 $\sigma$ from our mean and hence the result is still acceptable. 

In general, from the plot above we can clearly visualize that the prediction error increases as the actual price increase to larger margin. This indicates that the model is not doing as well for housing prices of higher value as compared to average to lower housing price.

# Conclusion
By using GradientBoostingRegressor along with several preprocessing techinuqes, we have managed to predict the housing prices with its attribute down to accuracy of around 129k of root mean squared error on average by trying out various preprocessing techniques and geo-location feature engineering.

## Saving Model
After completing model training and tuning, we can save our model into external file and reload it later during deployment.

In [None]:
# import pickle
# pickle.dump(randomsearch_pipe, open( "model/king_county_pipeline.p", "wb" )) #Dumping model into pickle file

In [None]:
# loaded_model = pickle.load( open( "model/king_county_pipeline.p", "rb" ) ) #Loading model
# loaded_model.predict(X_test) #Perform prediction with loaded model

## Personal Learning Reflection

Through the seattle housing price prediction problem, I've learned more of **Geo-location Feature Engineering** without leaking the actual housing prices as well as grasp a better understanding of the **Bias-Variance trade-off** through countless iterations of redefining the params grid, hyperparameter tuning, model evaluation and again! Initially due to the poor design of parameter searching grid, which results in the resulting model being more overfitted than the default parameter, despite the drop in test error. By doing more research and read-up on the Gradient Boosting, I've identified several key hyperparameters that can improve the performance without increasing the variance as much. I've also decided to make use of **AWS Sagemaker** to host and run the entire experiment to speed up the experimenting iteration for this project.

Written By : Wong Zhao Wu

Last Modified : 25 May 2021

![seattle.jpg](https://images.unsplash.com/photo-1502175353174-a7a70e73b362?ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&ixlib=rb-1.2.1&auto=format&fit=crop&w=970&q=80)

Image retrieved from [Unsplash](https://unsplash.com/photos/skUTVJi8-jc).


Thank you for your reading!