# **KING COUNTY HOUSE PRICE**
**[by Fabrizio Basso](https://www.linkedin.com/in/fabrizio-basso-4543463b/)**

## Dataset
<hr/>

* This dataset contains house sale prices for King County, which includes Seattle. 
* It includes homes sold between May 2014 and May 2015.
* 21 columns. (features)
* 21613 rows.

***Feature Columns***
    
* **id:** Unique ID for each home sold
* **date:** Date of the home sale --> This is the target feature
* **price:** Price of each home sold
* **bedrooms:** Number of bedrooms
* **bathrooms:** Number of bathrooms, where .5 accounts for a room with a toilet but no shower
* **sqft_living:** Square footage of the apartments interior living space
* **sqft_lot:** Square footage of the land space
* **floors:** Number of floors
* **waterfront:** - A dummy variable for whether the apartment was overlooking the waterfront or not
* **view:** An index from 0 to 4 of how good the view of the property was
* **condition:** - An index from 1 to 5 on the condition of the apartment,
* **grade:** An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.
* **sqft_above:** The square footage of the interior housing space that is above ground level
* **sqft_basement:** The square footage of the interior housing space that is below ground level
* **yr_built:** The year the house was initially built
* **yr_renovated:** The year of the house’s last renovation
* **zipcode:** What zipcode area the house is in
* **lat:** Lattitude
* **long:** Longitude
* **sqft_living15:** The square footage of interior housing living space for the nearest 15 neighbors
* **sqft_lot15:** The square footage of the land lots of the nearest 15 neighbors

### Import of Main Libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Regular Imports
import os
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import matplotlib.patches as patches
from tabulate import tabulate
import missingno as msno 
import warnings
from joblib import dump, load
warnings.filterwarnings("ignore")

!pip install -U scikit-learn==0.24.1

import sklearn
sklearn.__version__
from sklearn.preprocessing import OneHotEncoder

# Set Color Palettes for the notebook
custom_colors = ['#74a09e','#86c1b2','#98e2c6','#f3c969','#f2a553', '#d96548', '#c14953']
sns.palplot(sns.color_palette(custom_colors))

# Set Style
sns.set_style("whitegrid",{"grid.linestyle":"--"})
sns.despine(left=True, bottom=True)
mpl.rcParams['figure.dpi'] = 250
mpl.rc('axes', labelsize=10)
plt.rc('xtick',labelsize=10)
plt.rc('ytick',labelsize=10)

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
        

## **2.0 Exploraty Data Analysis**

In this section the main goal is to get familiar with the dataset. Among all the topics covered, it will address the following questions.

### Which features contain blank, null or empty values?

We can check for missing values with pandas isna(). This indicates whether values are missing or not. Then we can sum all the values to check every column.

### Which features are categorical?

These values classify the samples into sets of similar samples. Within categorical features are the values nominal, ordinal, ratio, or interval based? Among other things this helps us select the appropriate plots for visualization.

* **Categorical**: id, waterfront, zipcode.

### Which features are numerical? 
These values change from sample to sample. Within numerical features are the values discrete, continuous, or timeseries based? Among other things this helps us select the appropriate plots for visualization.

* **Continous**: price, bathrooms, floors, lat, long.
* **Discrete**: date, bedrooms, sqft_living, sqft_lot, view, condition, grade, sqft_above, sqft_basement, yr_built, yr_renovated, sqft_living15, sqft_lot15.

In [None]:
#import the dataset
house_df = pd.read_csv('../input/housesalesprediction/kc_house_data.csv',parse_dates=['date'])

### 2.1 **Assess the precence of missing values** 

In [None]:
house_df.isna().sum()

In [None]:
msno.matrix(house_df, figsize=(12.5,5), fontsize=10, color=(0.8, 0.25, 0.25));

Since there is no missing value in the dataset there is no need for imputation.

Now the number of unique values for each feature is assessed:

In [None]:
for column in house_df.columns:
    print(f'Unique values for {column}: {len(house_df[column].unique())}')

"id" feature has basically an unique values for each transaction to identify it. Therefore, it can be eliminated from the dataset as not informative.

In [None]:
house_df.drop('id', axis=1, inplace=True)

Moreover the data appear to be in the right format:

In [None]:
house_df.info()

### **2.2 Pearson correlation matrix**

The Pearson correlation coefficient evaluates the strength and direction of the linear relationship between two variables. The coefficient ranges between -1 and +1. The greater the value in absolute term, the stonger is the relationship between two features.

In [None]:
sns.set(style="whitegrid", font_scale=1)

plt.figure(figsize=(13,13))
plt.title('Pearson Correlation Matrix',fontsize=25)
sns.heatmap(house_df.corr(),linewidths=0.45,vmax=0.7,square=True,cmap="autumn_r",linecolor='w',
            annot=True, annot_kws={"size":7}, cbar_kws={"shrink": .8});

sns.set_style("whitegrid",{"grid.linestyle":"--"})

As next step in the EDA process, some of the features will be analyzed in details to assess their nature and what kind of information they have about the target feature, the house price. The first feature is the "Zipcode"

## - **ZipCode**

Taken at face value, the *zipcode* does not appear to capture much information about the house prices. Correlation is -0.05. However, this is highly misleading. All things equal, "Zipcodes" connected to posh, well-off areas identify proprieties with higher prices or values. In total there are 70 zipcodes in King County: 

In [None]:
# Number of Zipcodes:
len(house_df['zipcode'].unique())

The number of proprierty transaction is  highly un-envenly distribuited across the postal codes. It ranges from a top in the area of 600 circa to minimun of about 50. Moreover, proprieties facing the waterfront are only located in specific zipcodes covering areas closed to the seaside. 

In [None]:
fig, ax = plt.subplots(figsize=(13,6))

g = sns.countplot(x='zipcode', hue='waterfront', data=house_df, ax=ax, )
g.set_xticklabels(labels = house_df['zipcode'].unique(), rotation=90, fontsize=10);
g.grid(linestyle='--')

In [None]:
fig,ax = plt.subplots(figsize=(15,6))
sns.boxplot(x='zipcode',y='price',data=house_df,ax=ax, palette='Reds');
ax.set_xticklabels(labels = house_df['zipcode'].unique(), rotation=90, fontsize=10);
ax.set_title('Boxplot: Price Distribution by Zipcodes');

**Conclusion**: The graph above shows that some postcode-areas have significantly different price distribution than others. Therefore, zipcode is a relevant features. However, since it is a categorical feature, it has no direct correspondance between its intrinsic value and a specific variation in the price. Its value is only a convention: a number is used but it could be also a sequence of letter. For instance increasing the zipcode from 98001 to 98100 does not produce a specific change in the prices just because its value has been increased by 100. To adress this issue, **a different dummy needs to create afor each zipcode. This operation will be performed later on within a data pre-processing step within a pipeline.**

### - **Feature Analysis: Bedrooms, Floors and Bathrooms:**

The Graph below confirms the general intuition deriving from the correlation matrix. These features has on average a positive correlation with the proprietis prices.

In [None]:
f, axes = plt.subplots(1, 2,figsize=(15,5))
sns.boxplot(x=house_df['bedrooms'],y=house_df['price'], ax=axes[0], palette = 'autumn_r')
sns.boxplot(x=house_df['floors'],y=house_df['price'], ax=axes[1], palette = 'autumn_r')
sns.despine(left=True, bottom=True)
axes[0].set(xlabel='Bedrooms', ylabel='Price')
axes[0].yaxis.tick_left()
axes[1].yaxis.set_label_position("right")
axes[1].yaxis.tick_right()
axes[1].set(xlabel='Floors', ylabel='Price')

f, axe = plt.subplots(1, 1,figsize=(15,5))
sns.despine(left=True, bottom=True)
sns.boxplot(x=house_df['bathrooms'],y=house_df['price'], ax=axe, palette = 'autumn_r')
axe.yaxis.tick_left()
axe.set(xlabel='Bathrooms / Bedrooms', ylabel='Price');

### - **Feature Analysis: WaterFront, View and Grade:**
A similar consideration as above can be made for these three features. In particular, "waterfront" location seems to provide quite a boost to the propriety price. The same applies to the propriety's view and building quality (grade). It must be noted that in this case it is not necessary to create dummies out of "view" and "grade" since to a higher values in these feature also corresponds better qualities of the propriety. 

In [None]:
f, axes = plt.subplots(1, 2,figsize=(15,5))
sns.boxplot(x=house_df['waterfront'],y=house_df['price'], ax=axes[0], palette = 'viridis')
sns.boxplot(x=house_df['view'],y=house_df['price'], ax=axes[1], palette = 'viridis')
sns.despine(left=True, bottom=True)
axes[0].set(xlabel='Waterfront', ylabel='Price')
axes[0].yaxis.tick_left()
axes[1].yaxis.set_label_position("right")
axes[1].yaxis.tick_right()
axes[1].set(xlabel='View', ylabel='Price')


f, axe = plt.subplots(1, 1,figsize=(15,5))
sns.boxplot(x=house_df['grade'],y=house_df['price'], ax=axe, palette = 'viridis')
sns.despine(left=True, bottom=True)
axe.yaxis.tick_left()
axe.set(xlabel='Grade', ylabel='Price');

### **Construnction Year and Renovations: Binning**

Data binning is a preprocessing technique used to reduce the effects of minor observation errors. It is worthwhile applying this transformation to some columns of this dataset. Binning is applied to yr_built and yr_renovated. Ages and renovation ages of the houses are calculated in relation to the date the propriety is sold. The original feature is dropped. The distribuitions of these features is shown in the following graphs:

In [None]:
# just take the year from the date column
house_df['sales_yr']=pd.DatetimeIndex(house_df['date']).year
house_df['sales_mth']=pd.DatetimeIndex(house_df['date']).month

# add the age of the buildings when the houses were sold as a new column
house_df['age']=house_df['sales_yr']-house_df['yr_built']
# add the age of the renovation when the houses were sold as a new column
house_df['age_rnv']=0
house_df['age_rnv']=house_df['sales_yr'][house_df['yr_renovated']!=0].astype(int)-house_df['yr_renovated'][house_df['yr_renovated']!=0]
house_df['age_rnv'][house_df['age_rnv'].isnull()]=house_df['age']

In [None]:
# partition the age into bins
bins_age = [-2,1,5,10,20,30,60,100,100000]
labels = [0,5,10,20,30,60,80,100]
house_df['age_binned'] = pd.cut(house_df['age'], bins=bins_age, labels=labels)

In [None]:
# partition the age_rnv into bins
bins_ren = [-2,1,5,10,20,30,50,100000]
labels = [0,5,10,20,30,60,100]
house_df['age_rnv_binned'] = pd.cut(house_df['age_rnv'], bins=bins_ren, labels=labels)

In [None]:
fig, ax = plt.subplots(1,2, figsize=(15,5))

sns.countplot(house_df['age_binned'], palette='Reds', ax=ax[0], alpha=0.85);
sns.countplot(house_df['age_rnv_binned'], palette='Blues', ax=ax[1], alpha=0.85)
ax[0].set_title('Years since Construction')
ax[1].set_title('Years since Renovation');

### **Year and Month of Transaction - Information Extraction**

The transactions range from May-2014 to May-2015. There is about a fluctuation of circa 10% between the min and max at monthly levels.
The date is split in two different features: Year and Month of the transaction, while the original feature is dropped.


In [None]:
house_df.groupby(["sales_yr","sales_mth"])["price"].agg(['mean','median']).plot(figsize=(15,6), marker='*', markersize = 12)
plt.title('Price Evolution over Time', fontsize=17);

In [None]:
house_df.drop(['date'], inplace=True, axis=1)
house_df.drop(['yr_built','yr_renovated'], inplace=True, axis=1)

house_df_bin = house_df[['price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
                         'waterfront', 'view', 'condition', 'grade', 'sqft_above','sqft_basement',
                         'zipcode', 'lat', 'long', 'sqft_living15','sqft_lot15', 'sales_yr',
                         'sales_mth','age_binned','age_rnv_binned']]

### **Features to Normalize**

Some Features distribution go through a log transformation, to make their distribution more Normal-like. The original features are then dropped.

a) **Price**:

In [None]:
plt.figure(figsize=(10,6))
sns.distplot(house_df_bin["price"], color='r');

In [None]:
house_df_bin["log_price"] = np.log(house_df_bin["price"])

plt.figure(figsize=(10,6))
sns.distplot(house_df_bin["log_price"], color='r');

The original price feature is then dropped:

In [None]:
house_df_bin.drop(['price'], inplace=True, axis=1)

### **Addtional Feature to Normalize**:

The same procedure is applied to these features:

- sqft_living
- sqft_lot
- sqft_basement 
- sqft_living15 
- sqft_lot15

Since the value "0" cannot go through a log transformation, a +1 is added to those features that can assume that value:

In [None]:
cols = ["sqft_living","sqft_lot","sqft_basement","sqft_living15","sqft_lot15"]
house_df_bin[cols].describe()

Basement squared feet size is the only features showing 0 in this subset of features.

In [None]:
house_df_bin.loc[:,'sqft_basement_log'] = np.log(house_df_bin.loc[:,'sqft_basement']+1)
house_df_bin.loc[:,'sqft_living_log'] = np.log(house_df_bin.loc[:,'sqft_living'])
house_df_bin.loc[:,'sqft_lot_log'] = np.log(house_df_bin.loc[:,'sqft_lot'])
house_df_bin.loc[:,'sqft_living15_log'] = np.log(house_df_bin.loc[:,'sqft_living15'])
house_df_bin.loc[:,'sqft_lot15_log'] = np.log(house_df_bin.loc[:,'sqft_lot15'])

The original and the new distribution of these features is shown in the graphs below:

In [None]:
log_cols = ["sqft_living_log","sqft_lot_log","sqft_basement_log","sqft_living15_log","sqft_lot15_log"]

fig, axes = plt.subplots(2,5,figsize=(13,5))
axes = np.ravel(axes)
for num, ax in enumerate(axes):
  if num<5:
    sns.distplot(house_df_bin[cols[num]],ax=ax, color=custom_colors[num])
  else:
    sns.distplot(house_df_bin[log_cols[num-5]],ax=ax, color=custom_colors[num-5])
    
plt.tight_layout()

As before, original values are then dropped:

In [None]:
house_df_bin.drop(cols, axis=1, inplace=True)

## **<font color='green'>3 Data Preprocessing/Feature Engineering</font>:**

### 3.1 Pre-Processing Data
Apply pre-processing steps to your training and testing datasets separately in order to avoid data leakage.

**Data normalization**

If the dataset has numerical features with different scales, standardize the data to create a dataset within the same scale.
StandardScaler and MinMaxScaler are very popular data normalization methods.

**OneHotEncoder()**

If the dataset that will be used for the regression model includes categorical and/or boolean-type columns, use OneHotEncoder to transform them into numeric arrays.
Below, categorical columns are determined and OneHotEncoder is initiated:

In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler

house_df_bin['age_binned'] = house_df_bin['age_binned'].astype('int64') 
house_df_bin['age_rnv_binned'] = house_df_bin['age_rnv_binned'].astype('int64') 

numerical_columns = house_df_bin.drop(['log_price','zipcode'], axis=1).columns
scaler = MinMaxScaler()

categorical_columns = ['zipcode']
ohe = OneHotEncoder(handle_unknown='error', drop='first', sparse=False)

**Features** and **Target** variables are then separated and defined:

In [None]:
X_bin = house_df_bin.drop(['log_price'], axis=1)
y = house_df_bin['log_price']

Features dataset has 20 different features:

In [None]:
print(f'Total number of Features: {len(X_bin.columns)}')
X_bin.columns

### 3.2 Train-Validation-Test dataset:

The dataset is now divided in a training, validation and test dataset. Random_state is set to 1703... I need to pay my respects to St. Patrick!

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV

X_train, X_test, y_train, y_test = train_test_split(X_bin, y,  test_size=.15, random_state=170378)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train,  test_size=.18, random_state=170378)

print(f"Train Data Shape: {X_train.shape}")
print(f"Valid Data Shape: {X_valid.shape}")
print(f"Test Data Shape: {X_test.shape}")

Since later on the models will be evaluated using the price in level, the target features, y, are also stored in levels.

In [None]:
y_train_lev = np.exp(y_train)
y_valid_lev = np.exp(y_valid)
y_test_lev = np.exp(y_test)

Train dataset will be used to train the models, while test and validation to test the model generalization proprieties. IN particular the test dataset is created to make the submission forecast.

**ColumnTransformer()**

This applies transformers to columns of an array or a pandas DataFrame. This is will be the first step of the pipeline to fit the models.

In [None]:
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(transformers =[('num', scaler, numerical_columns),('cat', ohe, categorical_columns)],remainder='drop')

## 4. Modeling
**Instantiate Regression algorithm**

Two main regression algortithms will be tested. 

1. Random Forest Regression;
2. Extra Tree Regression;
3. XGBRegressor;
4. Artificial Neural Networks.

GridsearchCV is applied, using the default setting of 5 cross-validation folders, to fine tune the models hyperparameters for the first three models. 

In [None]:
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score

from sklearn import set_config
set_config(display='diagram',)

### **4.1 Random Forest Regression**



In [None]:
#rf_reg = RandomForestRegressor(n_jobs=-1,random_state= 1703,criterion= 'mse')
#rf_params = {'max_depth': [15,17,20,22,25],
#             'max_features':[18,20,22,25,30,35],             
#             'n_estimators': [300,400,500,700,900]}
#rf_gridsearch = GridSearchCV(estimator=rf_reg,
#                              param_grid=rf_params,
#                              cv=5,
#                              return_train_score=True)

**Create pipeline**

After instantiating GridSearchCV, a pipeline is created. This pipeline will allow to perform data transformation (one hot encoder and data standardization) and GridSearchCV at the same time.

In [None]:
#pipeline = Pipeline(steps=[('preprocessor', preprocessor),
#                           ('m', rf_gridsearch)])

**Fit model**

The model is fit on the training set.

In [None]:
#%%time
#model = pipeline.fit(X_train, y_train)
#model

These are the best hyperparameters selected for the Random Forest Model:

In [None]:
#model['m'].best_params_

The cells above have been silenced, as running them would take to much time. However, the optimal hyperparameter selected are:

- max_depth: 25,
- max_features: 35,             
- n_estimators: 900

In [None]:
n_estimators = 900 #model['m'].best_params_['n_estimators']
max_depth = 25 #model['m'].best_params_['max_depth']
max_features = 35 #model['m'].best_params_['max_features']

The Model is now fitted using the Selected Hyperparameters.

In [None]:
%%time

from sklearn.model_selection import cross_val_score

rf_opt = RandomForestRegressor(n_estimators=n_estimators,
                               max_depth=max_depth,
                               max_features=max_features,
                               n_jobs=-1,
                               random_state= 1703,
                               criterion= 'mse')

best_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                ('m', rf_opt)])

print(cross_val_score(best_pipeline,X_train, y_train,cv=5))

best_model = best_pipeline.fit(X_train, y_train)


#dump(best_model, 'rf_1_best_model.joblib') 
#best_model = load('rf_1_best_model.joblib') 

According to the results from the cross validation analysis we expect the Random Forest model to achieve an R2 score on the validation set around 0.89. 

This result is confirmed in the next cell:

In [None]:
#Train Score: 
print(f'Score on Training set: {best_model.score(X_train, y_train)}')
#Validation Score:
print(f'Score on Valuation set: {best_model.score(X_valid, y_valid)}')

The models seems to overfit the training data. This is also confirmed by the following graphs, where the models visuaaly performas better on the training dataset in comparison to the validation set. This is clear from both the prediction on the price in log and in levels:

In [None]:
y_hat_train = best_model.predict(X_train)
y_hat_valid = best_model.predict(X_valid)

y_hat_train_lev = np.exp(y_hat_train)
y_hat_valid_lev = np.exp(y_hat_valid)

In [None]:
fig, ax = plt.subplots(1,2,figsize=(11,5), sharey=True, sharex=True)

ax[0].scatter(y_hat_train,y_train)
ax[1].scatter(y_hat_valid,y_valid, c='r')
ax[0].set_title('Train Dataset')
ax[1].set_title('Validation Dataset');

In [None]:
fig, ax = plt.subplots(1,2,figsize=(11,5), sharey=True, sharex=True)

ax[0].scatter(y_hat_train_lev,y_train_lev)
ax[1].scatter(y_hat_valid_lev,y_valid_lev, c='r')
ax[0].set_title('Train Dataset')
ax[1].set_title('Validation Dataset');

A DataFRame to store the results is created:

In [None]:
index = ['RandomForest','ExtraTree','XGBRegressor']
col = ['R2 Train', 'RMSE Train','R2 Valid', 'RMSE Valid']

results_df_log = pd.DataFrame(index=index, columns=col)
results_df_lev = pd.DataFrame(index=index, columns=col)

The two key metrics (R2 and RMSE) are now calculated for the price in level and in log and then stored:

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

mse_train_rf = mean_squared_error(y_train, y_hat_train, squared=False)
mse_valid_rf = mean_squared_error(y_valid, y_hat_valid, squared=False)

r2_train_rf = r2_score(y_train, y_hat_train)
r2_valid_rf = r2_score(y_valid, y_hat_valid)

print(f'MSE Score on Training set: {mse_train_rf}')
print(f'MSE Score on Validation set: {mse_valid_rf}')
print('\n')
print(f'R2 Score on Training set: {r2_train_rf}')
print(f'R2 Score on Validation set: {r2_valid_rf}')

In [None]:
results_df_log.loc['RandomForest','R2 Train'] = r2_train_rf
results_df_log.loc['RandomForest','R2 Valid'] = r2_valid_rf
results_df_log.loc['RandomForest','RMSE Train'] = mse_train_rf
results_df_log.loc['RandomForest','RMSE Valid'] = mse_valid_rf

In [None]:
mse_train_rf = mean_squared_error(y_train_lev, y_hat_train_lev, squared=False)
mse_valid_rf = mean_squared_error(y_valid_lev, y_hat_valid_lev, squared=False)

r2_train_rf = r2_score(y_train_lev, y_hat_train_lev)
r2_valid_rf = r2_score(y_valid_lev, y_hat_valid_lev)

print(f'MSE Score on Training set: {mse_train_rf}')
print(f'MSE Score on Valid set: {mse_valid_rf}')
print('\n')
print(f'R2 Score on Training set: {r2_train_rf}')
print(f'R2 Score on Valid set: {r2_valid_rf}')

In [None]:
index = ['RandomForest','ExtraTree','XGBRegressor','Art Neural Net']
col = ['R2 Train', 'RMSE Train','R2 Valid', 'RMSE Valid']

results_df_lev.loc['RandomForest','R2 Train'] = r2_train_rf
results_df_lev.loc['RandomForest','R2 Valid'] = r2_valid_rf
results_df_lev.loc['RandomForest','RMSE Train'] = mse_train_rf
results_df_lev.loc['RandomForest','RMSE Valid'] = mse_valid_rf

On average, the RandomForest model error in evaluating the price of a propriety in the validation set is 134K. The average price in the sample is 545k USD, meaning an evaluation error of 23.5%. 

### 4.2 **Extra Tree Regressor**

The same approach used for the Random Forest Model is now used for the ExtraTrees:

In [None]:
#et_reg = ExtraTreesRegressor(n_jobs=-1,random_state= 1703,criterion= 'mse')
#et_params = {'max_depth': [15,17,20,22,25],
#             'max_features':[18,20,22,25,30,35],             
#             'n_estimators': [300,400,500,700,900]}
#et_gridsearch = GridSearchCV(estimator=et_reg,
#                             param_grid=et_params,
#                             cv=5,
#                             return_train_score=True)

In [None]:
#pipeline = Pipeline(steps=[('preprocessor', preprocessor),
#                           ('m', et_gridsearch)])

In [None]:
#model = pipeline.fit(X_train, y_train)
#model

In [None]:
#model['m'].best_params_

The cells above have been silenced, as running them would take to much time. However, the optimal hyperparameter selected are:

- max_depth: 25,
- max_features: 35,             
- n_estimators: 900

In [None]:
n_estimators = 900 #model['m'].best_params_['n_estimators']
max_depth = 25 #model['m'].best_params_['max_depth']
max_features = 35 #model['m'].best_params_['max_features']

In [None]:
%%time

from sklearn.model_selection import cross_val_score

et_opt = ExtraTreesRegressor(n_estimators=n_estimators,
                               max_depth=max_depth,
                               max_features=max_features,
                               n_jobs=-1,
                               random_state= 1703,
                               criterion= 'mse')

et_best_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                ('m', et_opt)])

print(cross_val_score(et_best_pipeline,X_train, y_train,cv=5))

et_best_model = et_best_pipeline.fit(X_train, y_train)

According to the results from the cross validation analysis we expect the ExtraTree model to achieve an R2 score on the validation set around 0.89. 

This result is confirmed in the next cell:

In [None]:
#Train Score: 
print(f'Score on Training set: {et_best_model.score(X_train, y_train)}')
#Validation Score:
print(f'Score on Valuation set: {et_best_model.score(X_valid, y_valid)}')

Again, the models seems to overfit the training data. Although the model achieves a slight improvement on the validation dataset, the gap with the train dataset is even wider. This is also confirmed by the following graphs, where the models visuaaly performs almost perfectly on the training dataset in comparison to the validation set. This is clear from both the prediction on the price in log and in levels:

In [None]:
y_hat_train = et_best_model.predict(X_train)
y_hat_valid = et_best_model.predict(X_valid)

y_hat_train_lev = np.exp(y_hat_train)
y_hat_valid_lev = np.exp(y_hat_valid)

In [None]:
fig, ax = plt.subplots(1,2,figsize=(11,5), sharey=True, sharex=True)

ax[0].scatter(y_hat_train,y_train)
ax[1].scatter(y_hat_valid,y_valid, c='r')
ax[0].set_title('Train Dataset')
ax[1].set_title('Validation Dataset')

plt.suptitle('ExtraTree Regressor');

In [None]:
mse_train_et = mean_squared_error(y_train, y_hat_train, squared=False)
mse_valid_et = mean_squared_error(y_valid, y_hat_valid, squared=False)

r2_train_et = r2_score(y_train, y_hat_train)
r2_valid_et = r2_score(y_valid, y_hat_valid)

print(f'MSE Score on Training set: {mse_train_et}')
print(f'MSE Score on Validation set: {mse_valid_et}')
print('\n')
print(f'R2 Score on Training set: {r2_train_et}')
print(f'R2 Score on Training set: {r2_valid_et}')

In [None]:
results_df_log.loc['ExtraTree','R2 Train'] = r2_train_et
results_df_log.loc['ExtraTree','R2 Valid'] = r2_valid_et
results_df_log.loc['ExtraTree','RMSE Train'] = mse_train_et
results_df_log.loc['ExtraTree','RMSE Valid'] = mse_valid_et

In [None]:
mse_train_et = mean_squared_error(y_train_lev, y_hat_train_lev, squared=False)
mse_valid_et = mean_squared_error(y_valid_lev, y_hat_valid_lev, squared=False)

r2_train_et = r2_score(y_train_lev, y_hat_train_lev)
r2_valid_et = r2_score(y_valid_lev, y_hat_valid_lev)

print(f'MSE Score on Training set: {mse_train_et}')
print(f'MSE Score on Valid set: {mse_valid_et}')
print('\n')
print(f'R2 Score on Training set: {r2_train_et}')
print(f'R2 Score on Valid set: {r2_valid_et}')

In [None]:
results_df_lev.loc['ExtraTree','R2 Train'] = r2_train_et
results_df_lev.loc['ExtraTree','R2 Valid'] = r2_valid_et
results_df_lev.loc['ExtraTree','RMSE Train'] = mse_train_et
results_df_lev.loc['ExtraTree','RMSE Valid'] = mse_valid_et

In [None]:
results_df_lev

On average, the ExtraTrees model error in evaluating the price of a propriety in the validation set is 121'088, slightly better than the RandomForest. The average price in the sample is 545k USD, meaning an evaluation error of 22.2%.

### 4.3 **XGBRegressor**

The same approach used for the Random Forest/Extra Tree Model is now used for the XGBRegressor Model:

In [None]:
# hyper-parameters to tune
#xgb1 = XGBRegressor(nthread=4,subsample=0.9,colsample_bytree=0.7,min_child_weight=4,silent=1,objective='reg:squarederror',verbosity=0)

#xg_param = {'learning_rate': [0.01, 0.03, 0.05, 0.1],
#              'max_depth': [7, 8, 9, 10],
#              'n_estimators': [200, 300, 500, 700, 900]}

#Xb_gridsearch = GridSearchCV(estimator=xgb1,
#                              param_grid=xg_param,
#                              cv=5,
#                              return_train_score=True) 

In [None]:
#xg_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
#                              ('m', Xb_gridsearch)])

In [None]:
#%%time
#xg_best_model = xg_pipeline.fit(X_train, y_train)
#xg_best_model

In [None]:
#xg_best_model['m'].best_params_

The cells above have been silenced, as running them would take to much time. However, the optimal hyperparameter selected are:

- Learning Rate: 0.03,
- max_depth:8,             
- n_estimators: 700

In [None]:
learn_rate = 0.03 #xg_best_model['m'].best_params_.get('learning_rate')
n_est = 700 #xg_best_model['m'].best_params_.get('n_estimators')
tree_md = 8 #xg_best_model['m'].best_params_.get('max_depth')

In [None]:
%%time
from sklearn.model_selection import cross_val_score
# Various hyper-parameters to tune
xgb_opt = XGBRegressor(learning_rate=learn_rate,
                       n_estimators=n_est,
                       max_depth=tree_md,
                       nthread=4,
                       subsample=0.9,
                       colsample_bytree=0.7,
                       min_child_weight=4,
                       objective='reg:squarederror')

best_xg_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                    ('m', xgb_opt)])

print(cross_val_score(best_xg_pipeline,X_train, y_train,cv=5))

best_xg_model = best_xg_pipeline.fit(X_train, y_train)
best_xg_model

In [None]:
#Train Score: 
print(f'Score on Training set: {best_xg_model.score(X_train, y_train)}')
#Validation Score:
print(f'Score on Valuation set: {best_xg_model.score(X_valid, y_valid)}')

The models overfit affecting former model is partially reduced in XGBRegressor. Cross validation results and the score achieved on the validation set mark a significant improvements in comparison to the preceeding two models. In general we should expect a R2 score on unseen data around 0.91, marking an improvement in comparison to the two preceeding model in the range of 2%.

In [None]:
y_hat_train = best_xg_model.predict(X_train)
y_hat_valid = best_xg_model.predict(X_valid)

y_hat_train_lev = np.exp(y_hat_train)
y_hat_valid_lev = np.exp(y_hat_valid)

In [None]:
fig, ax = plt.subplots(1,2,figsize=(11,5), sharey=True, sharex=True)

ax[0].scatter(y_hat_train,y_train)
ax[1].scatter(y_hat_valid,y_valid, c='r')
ax[0].set_title('Train Dataset')
ax[1].set_title('Validation Dataset')

plt.suptitle('XGBRegressor');

In [None]:
fig, ax = plt.subplots(1,2,figsize=(11,5), sharey=True, sharex=True)

ax[0].scatter(y_hat_train_lev,y_train_lev)
ax[1].scatter(y_hat_valid_lev,y_valid_lev, c='r')
ax[0].set_title('Train Dataset')
ax[1].set_title('Validation Dataset');

plt.suptitle('XGBRegressor');

In [None]:
mse_train_xgb = mean_squared_error(y_train, y_hat_train, squared=False)
mse_valid_xgb = mean_squared_error(y_valid, y_hat_valid, squared=False)

r2_train_xgb = r2_score(y_train, y_hat_train)
r2_valid_xgb = r2_score(y_valid, y_hat_valid)

print(f'MSE Score on Training set: {mse_train_xgb}')
print(f'MSE Score on Validation set: {mse_valid_xgb}')
print('\n')
print(f'R2 Score on Training set: {r2_train_xgb}')
print(f'R2 Score on Training set: {r2_valid_xgb}')

In [None]:
results_df_log.loc['XGBRegressor','R2 Train'] = r2_train_xgb
results_df_log.loc['XGBRegressor','R2 Valid'] = r2_valid_xgb
results_df_log.loc['XGBRegressor','RMSE Train'] = mse_train_xgb
results_df_log.loc['XGBRegressor','RMSE Valid'] = mse_valid_xgb

In [None]:
mse_train_xgb = mean_squared_error(y_train_lev, y_hat_train_lev, squared=False)
mse_valid_xgb = mean_squared_error(y_valid_lev, y_hat_valid_lev, squared=False)

r2_train_xgb = r2_score(y_train_lev, y_hat_train_lev)
r2_valid_xgb = r2_score(y_valid_lev, y_hat_valid_lev)

print(f'MSE Score on Training set: {mse_train_xgb}')
print(f'MSE Score on Valid set: {mse_valid_xgb}')
print('\n')
print(f'R2 Score on Training set: {r2_train_xgb}')
print(f'R2 Score on Valid set: {r2_valid_xgb}')

In [None]:
results_df_lev.loc['XGBRegressor','R2 Train'] = r2_train_xgb
results_df_lev.loc['XGBRegressor','R2 Valid'] = r2_valid_xgb
results_df_lev.loc['XGBRegressor','RMSE Train'] = mse_train_xgb
results_df_lev.loc['XGBRegressor','RMSE Valid'] = mse_valid_xgb

In [None]:
results_df_lev

On average, the XGBRegressor model error in evaluating the price of a propriety in the validation set is 114K, the lowest value so far scored. The average price in the sample is 545k USD, meaning an evaluation error of 20.9%, more than 2.5% lower than the ExtraTrees Model.

## Conclusion

XGBRagressor is the model delivering the best results on the validation dataset. The table below summarizes the overall results using the target feature in log:

In [None]:
results_df_log

And in levels:

In [None]:
results_df_lev

XGBRegressor comes at the top of the contest followed by the ExtraTreesRegressor. The Root-MeanSquaredError is 114k USD on the validation set. The model is now tested on the test dataset 

In [None]:
y_hat_test = best_xg_model.predict(X_test)
y_hat_test_lev = np.exp(y_hat_test)

In [None]:
mse_test_xgb = mean_squared_error(y_test, y_hat_test, squared=False)
r2_test_xgb = r2_score(y_test, y_hat_test)

print(f'MSE Score on Test set: {mse_test_xgb}')
print('\n')
print(f'R2 Score on Test set: {r2_test_xgb}')

In [None]:
mse_test_xgb = mean_squared_error(y_test_lev, y_hat_test_lev, squared=False)
r2_test_xgb = r2_score(y_test_lev, y_hat_test_lev)

print(f'MSE Score on Test set: {mse_test_xgb}')
print('\n')
print(f'R2 Score on Test set: {r2_test_xgb}')

In [None]:
fig, ax = plt.subplots(1,1,figsize=(6,5), sharey=True, sharex=True)

ax.scatter(y_hat_test_lev,y_test_lev);

The results on the test dataset confirm the data collected from the validation set. The Root mean squared error on the test set is 103K and R2 of 0.919.  

### Save prediction in a csv file 

In [None]:
prediction = pd.DataFrame(index=y_test.index,columns=['Real Value','Prediction'])

prediction['Real Value'] = y_test_lev
prediction['Prediction'] = y_hat_test_lev


#Convert DataFrame to a csv file that can be uploaded
#This is saved in the same directory as your notebook
filename = 'King County House Prediction.csv'

prediction.to_csv(filename,index=False)

print('Saved file: ' + filename)