<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2 - Singapore Housing Data and Kaggle Challenge - PART 3

---
# PART 3 - Modelling


In this notebook:<br>

* **Benchmark Modelling<br>**
Floor area shares the most direct positive correlation to resale price, hence a benchmark model was created as a preliminary gauge of the benchmark RMSE and cross_val score that would be achieved.

* **Pre-modelling Preparation<br>**
    * The cleaned train dataset was split via train-test-split. The numerical and categorical features were then separated respectively within the train and test set.
    * One hot encoding was done to the categorical features to convert these categorical data variables into numbers that can be applied to train our regression model.
    * Standard scaling was applied to our numerical features to ensure variance is reduced before training our Lasso and ridge models. This is because the ridge and lasso models place a penalty on the magnitude of the coefficients of each variable, for example coefficients of variables with a large variance are small and thus less penalized.
    * The transformed numerical and categorical features are then combined again for train and test set respectively.


* **Models and Model Selection**
    * Linear, ridge and lasso regression was performed with the train and test holdout set, and thereafter evaluated via the metrics (R-square, cross-val and RMSE scores)
<br>

* **Conclusion**
    * Overall conclusion of findings from the EDA and model
<br>
* **Kaggle Export<br>**
    * The final cleaned and transformed test.csv data was brought in and prepared in the format for kaggle submission.

---

Import required libraries:

In [1]:
import pandas as pd
from scipy.stats import ttest_ind
import numpy as np
from sklearn import metrics
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import Ridge

import seaborn               as sns
import matplotlib.pyplot     as plt
import matplotlib.ticker as mticker

import statsmodels.api as sm
import pickle
import joblib

%matplotlib inline

Import train (selected features) dataset:

In [2]:
train = pickle.load(open('./pkl/train_sel.pkl', 'rb'))

In [3]:
print(train.shape)
train.head()

(150634, 43)


Unnamed: 0,floor_area_sqm,resale_price,tranc_year,tranc_month,mid_storey,hdb_age,max_floor_lvl,total_dwelling_units,1room_sold,2room_sold,...,mrt_interchange,pri_sch_affiliation,affiliation,town,street_name,full_flat_type,planning_area,mrt_name,pri_sch_name,sec_sch_name
0,90.0,680000.0,2016,5,11,15,25,142,0,0,...,0,1,0,KALLANG/WHAMPOA,UPP BOON KENG RD,4 ROOM Model A,Kallang,Kallang,Geylang Methodist School,Geylang Methodist School
1,130.0,665000.0,2012,7,8,34,9,112,0,0,...,1,1,0,BISHAN,BISHAN ST 13,5 ROOM Improved,Bishan,Bishan,Kuo Chuan Presbyterian Primary School,Kuo Chuan Presbyterian Secondary School
2,144.0,838000.0,2013,7,14,24,16,90,0,0,...,0,0,0,BUKIT BATOK,BT BATOK ST 25,EXECUTIVE Apartment,Bukit Batok,Bukit Batok,Keming Primary School,Yusof Ishak Secondary School
3,103.0,550000.0,2012,4,3,29,11,75,0,0,...,1,1,1,BISHAN,BISHAN ST 22,4 ROOM Model A,Bishan,Bishan,Catholic High School,Catholic High School
4,83.0,298000.0,2017,12,2,34,4,48,0,0,...,0,0,0,YISHUN,YISHUN ST 81,4 ROOM Simplified,Yishun,Khatib,Naval Base Primary School,Orchid Park Secondary School


---
# 3.0 Benchmark Modelling

Floor area  shares the most direct positive correlation to resale price, hence I decided to create this benchmark model as a gauge of the benchmark RMSE and cross_val score that would be achieved. <br>
For further action, should time permit, it would be interesting to add on features one by one to the model, so we can get a clearer finding of exactly which feature has greater weightage to the target prediction, which brings more errors, etc.

In [4]:
#Extract just the 2 required variables
train_benchmark = train[['resale_price', 'floor_area_sqm']]
train_benchmark.head()

Unnamed: 0,resale_price,floor_area_sqm
0,680000.0,90.0
1,665000.0,130.0
2,838000.0,144.0
3,550000.0,103.0
4,298000.0,83.0


Define X predictor variables, and y target variable:

In [5]:
#define X predictor variables, and y target variable
X_ben = train_benchmark.drop(columns='resale_price')
y_ben = train_benchmark['resale_price']

Train test split:

In [6]:
X_ben_train, X_ben_test, y_ben_train, y_ben_test = train_test_split(X_ben, y_ben, test_size=0.3, random_state=42)

Instantiate and fit the model:

In [7]:
lr = LinearRegression()
lr.fit(X_ben_train, y_ben_train);

Model evaluation:

In [8]:
# Train score
lr.score(X_ben_train, y_ben_train)

0.4264526803114437

In [9]:
# Test score
lr.score(X_ben_test, y_ben_test)

0.4311786273004551

In [10]:
# Check the RMSE on the testing set
y_ben_preds = lr.predict(X_ben_test)

np.sqrt(metrics.mean_squared_error(y_ben_test, y_ben_preds))

107444.37189875565

---
# 3.1 Pre-modelling Preparation

## 3.1.1 Dataframe Split, OHE

Define X predictor variables, and `y target variable:

In [11]:
#define X predictor variables, and y target variable
X = train.drop(columns='resale_price')
y = train['resale_price']

In [83]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [84]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(105443, 42)
(45191, 42)
(105443,)
(45191,)


---
### X_train prep
Do the following for X_train_**num**:
* Create X_train_num variable
* Scale

In [14]:
Cat_feat = ['town', 'street_name', 'full_flat_type', 'planning_area', 'mrt_name',
       'pri_sch_name', 'sec_sch_name']

Bin_feat = ['commercial', 'market_hawker', 'multistorey_carpark',
       'precinct_pavilion', 'bus_interchange', 'mrt_interchange',
       'pri_sch_affiliation', 'affiliation']

In [15]:
# Drop cat values to get X_train_num
columns_cat = Cat_feat + Bin_feat
X_train_num = X_train.drop(columns=columns_cat, inplace=False)

In [16]:
print(X_train_num.shape)
X_train_num.head()

(105443, 27)


Unnamed: 0,floor_area_sqm,tranc_year,tranc_month,mid_storey,hdb_age,max_floor_lvl,total_dwelling_units,1room_sold,2room_sold,3room_sold,...,hawker_nearest_distance,hawker_within_2km,hawker_food_stalls,hawker_market_stalls,mrt_nearest_distance,bus_stop_nearest_distance,pri_sch_nearest_distance,vacancy,sec_sch_nearest_dist,cutoff_point
9509,66.0,2018,4,14,47,16,197,0,0,195,...,179.75505,12.0,80,12,811.746831,60.182754,275.483127,48,505.546279,188
72510,67.0,2015,11,11,38,13,178,0,0,143,...,353.35574,4.0,56,169,411.382755,95.161984,440.333626,33,768.380362,223
79362,120.0,2017,11,2,22,12,114,0,0,0,...,291.060448,1.0,43,0,296.236638,108.155684,336.494755,61,301.224943,211
120303,91.0,2013,12,2,36,4,16,0,0,8,...,658.934326,2.0,70,0,27.345357,77.509094,161.074038,54,426.452481,188
66702,74.0,2013,5,5,44,14,106,0,0,100,...,79.331964,1.0,28,52,1524.200011,91.153489,1880.19066,69,452.610469,237


---
Do the following for X_train_**cat**:
* Create X_train_cat variable
* OHE

In [17]:
# Get X_train_cat with specific columns
X_train_cat = X_train[columns_cat]
X_train_cat.head()

Unnamed: 0,town,street_name,full_flat_type,planning_area,mrt_name,pri_sch_name,sec_sch_name,commercial,market_hawker,multistorey_carpark,precinct_pavilion,bus_interchange,mrt_interchange,pri_sch_affiliation,affiliation
9509,KALLANG/WHAMPOA,WHAMPOA DR,3 ROOM Improved,Novena,Boon Keng,Hong Wen School,Bendemeer Secondary School,1,0,0,0,0,0,0,0
72510,JURONG EAST,JURONG EAST ST 32,3 ROOM New Generation,Jurong East,Chinese Garden,Jurong Primary School,Bukit Batok Secondary School,1,0,0,0,0,0,0,0
79362,WOODLANDS,WOODLANDS RING RD,5 ROOM Improved,Woodlands,Admiralty,Greenwood Primary School,Woodlands Ring Secondary School,0,0,0,0,0,0,0,0
120303,WOODLANDS,WOODLANDS ST 13,4 ROOM New Generation,Woodlands,Marsiling,Marsiling Primary School,Fuchun Secondary School,0,0,0,0,0,0,0,0
66702,JURONG EAST,TEBAN GDNS RD,3 ROOM Improved,Jurong East,Jurong East,Qifa Primary School,Commonwealth Secondary School,0,0,0,0,1,1,0,0


Perform get dummies for categorical features:

In [18]:
# Fit the encoder on the training data using the common columns
X_train_cat = pd.get_dummies(X_train_cat[columns_cat])

In [19]:
print(X_train_cat.shape)
X_train_cat.head()

(105443, 1060)


Unnamed: 0,commercial,market_hawker,multistorey_carpark,precinct_pavilion,bus_interchange,mrt_interchange,pri_sch_affiliation,affiliation,town_ANG MO KIO,town_BEDOK,...,sec_sch_name_Xinmin Secondary School,sec_sch_name_Yio Chu Kang Secondary School,sec_sch_name_Yishun Secondary School,sec_sch_name_Yishun Town Secondary School,sec_sch_name_Yuan Ching Secondary School,sec_sch_name_Yuhua Secondary School,sec_sch_name_Yusof Ishak Secondary School,sec_sch_name_Yuying Secondary School,sec_sch_name_Zhenghua Secondary School,sec_sch_name_Zhonghua Secondary School
9509,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
72510,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
79362,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
120303,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
66702,0,0,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


---
### X_test prep
Do the following for X_test_**num**:
* Create X_test_num variable
* Scale
* Fit and transform

In [20]:
# Drop cat values to get X_test_num
columns_cat = Cat_feat + Bin_feat
X_test_num = X_test.drop(columns=columns_cat, inplace=False)

print(X_test_num.shape)
X_test_num.head()

(45191, 27)


Unnamed: 0,floor_area_sqm,tranc_year,tranc_month,mid_storey,hdb_age,max_floor_lvl,total_dwelling_units,1room_sold,2room_sold,3room_sold,...,hawker_nearest_distance,hawker_within_2km,hawker_food_stalls,hawker_market_stalls,mrt_nearest_distance,bus_stop_nearest_distance,pri_sch_nearest_distance,vacancy,sec_sch_nearest_dist,cutoff_point
107690,90.0,2016,4,8,21,15,112,0,0,0,...,2645.572861,0.0,40,0,281.338938,59.202105,129.178432,72,372.278159,210
100411,122.0,2014,4,8,22,16,59,0,0,0,...,2265.83515,0.0,40,0,503.770173,126.042277,171.932686,87,278.795453,199
23295,145.0,2016,10,11,28,13,48,0,0,0,...,241.932595,9.0,28,32,256.724183,78.68339,182.181845,75,598.682957,211
68880,65.0,2020,10,8,45,16,177,0,0,174,...,428.259337,6.0,55,143,1833.14245,183.94154,179.059765,34,872.746957,225
88677,125.0,2017,5,11,28,13,72,0,0,0,...,1123.038855,2.0,42,0,894.66728,207.588401,682.089893,71,399.321648,229


---
Do the following for X_test_**cat**:
* Create X_test_cat variable
* OHE

In [21]:
# Get X_test_cat with specific columns
X_test_cat = X_test[columns_cat]
X_test_cat.head()

Unnamed: 0,town,street_name,full_flat_type,planning_area,mrt_name,pri_sch_name,sec_sch_name,commercial,market_hawker,multistorey_carpark,precinct_pavilion,bus_interchange,mrt_interchange,pri_sch_affiliation,affiliation
107690,SENGKANG,COMPASSVALE RD,4 ROOM Model A,Sengkang,Sengkang,Compassvale Primary School,Compassvale Secondary School,0,0,0,0,1,1,0,0
100411,SENGKANG,COMPASSVALE WALK,5 ROOM Improved,Sengkang,Sengkang,Seng Kang Primary School,Seng Kang Secondary School,0,0,0,0,1,1,0,0
23295,TOA PAYOH,LOR 2 TOA PAYOH,EXECUTIVE Apartment,Toa Payoh,Braddell,Kheng Cheng School,Beatty Secondary School,0,0,0,0,0,0,0,0
68880,MARINE PARADE,MARINE DR,3 ROOM Improved,planning_area_others,Eunos,Tao Nan School,Saint Patrick's School,1,0,0,0,1,0,0,0
88677,TAMPINES,TAMPINES ST 45,5 ROOM Improved,Tampines,Tampines East,Tampines North Primary School,Dunman Secondary School,0,0,0,0,0,0,0,0


Perform get dummies for categorical features:

In [22]:
# Fit the encoder on the training data using the common columns
X_test_cat = pd.get_dummies(X_test_cat[columns_cat])

# Make sure the column names of the test data match the training data
X_test_cat = X_test_cat.reindex(columns=X_train_cat.columns, fill_value=0)

In [23]:
print(X_test_cat.shape)
X_test_cat.head()

(45191, 1060)


Unnamed: 0,commercial,market_hawker,multistorey_carpark,precinct_pavilion,bus_interchange,mrt_interchange,pri_sch_affiliation,affiliation,town_ANG MO KIO,town_BEDOK,...,sec_sch_name_Xinmin Secondary School,sec_sch_name_Yio Chu Kang Secondary School,sec_sch_name_Yishun Secondary School,sec_sch_name_Yishun Town Secondary School,sec_sch_name_Yuan Ching Secondary School,sec_sch_name_Yuhua Secondary School,sec_sch_name_Yusof Ishak Secondary School,sec_sch_name_Yuying Secondary School,sec_sch_name_Zhenghua Secondary School,sec_sch_name_Zhonghua Secondary School
107690,0,0,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
100411,0,0,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
23295,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
68880,1,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
88677,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


---
## 3.1.2 Scale Data
Standardize interaction dataframes for train and test. then concat it with train and test categorical dummies df:

In [24]:
num_columns = X_train_num.columns.tolist()

In [25]:
# Initialize the StandardScaler
ss = StandardScaler()

# Scale the numerical columns and replace them in the copied DataFrame (for train num set)
X_train_num[num_columns] = ss.fit_transform(X_train_num[num_columns])
print(X_train_num.shape)

# Scale the numerical columns and replace them in the copied DataFrame (for test num set)
X_test_num[num_columns] = ss.transform(X_test_num[num_columns])
print(X_test_num.shape)
X_test_num.head()

(105443, 27)
(45191, 27)


Unnamed: 0,floor_area_sqm,tranc_year,tranc_month,mid_storey,hdb_age,max_floor_lvl,total_dwelling_units,1room_sold,2room_sold,3room_sold,...,hawker_nearest_distance,hawker_within_2km,hawker_food_stalls,hawker_market_stalls,mrt_nearest_distance,bus_stop_nearest_distance,pri_sch_nearest_distance,vacancy,sec_sch_nearest_dist,cutoff_point
107690,-0.29866,-0.17935,-0.772735,-0.051241,-0.620515,-0.022855,-0.215492,-0.022425,-0.152469,-0.57043,...,1.366059,-0.944695,-0.451424,-1.047253,-1.124229,-1.00859,-1.129642,0.929812,-0.4422,-0.009736
100411,1.012988,-0.905738,-0.772735,-0.051241,-0.538053,0.137305,-1.124674,-0.022425,-0.152469,-0.57043,...,1.013496,-0.944695,-0.451424,-1.047253,-0.606426,0.192312,-0.948291,1.767838,-0.744207,-0.558691
23295,1.955736,-0.17935,1.022014,0.491376,-0.043279,-0.343176,-1.313372,-0.022425,-0.152469,-0.57043,...,-0.865571,1.297427,-1.076292,-0.471789,-1.18153,-0.658574,-0.904817,1.097417,0.289227,0.040169
68880,-1.323386,1.273428,1.022014,-0.051241,1.35858,0.137305,0.899543,-0.022425,-0.152469,1.97217,...,-0.692578,0.550053,0.329661,1.524349,2.488254,1.232575,-0.91806,-1.193189,1.174624,0.738839
88677,1.135956,0.183845,-0.47361,0.491376,-0.043279,-0.343176,-0.901667,-0.022425,-0.152469,-0.57043,...,-0.047519,-0.446446,-0.347279,-1.047253,0.303554,1.657432,1.215648,0.873943,-0.354833,0.938459


---
**Combined scaled num variables, and unscaled cat dummy variables**

In [26]:
# Combine Z_train and X_train_num along axis=1 (column-wise)
Z_train = pd.concat([X_train_num, X_train_cat], axis=1)

In [27]:
print(Z_train.shape)
Z_train.head()

(105443, 1087)


Unnamed: 0,floor_area_sqm,tranc_year,tranc_month,mid_storey,hdb_age,max_floor_lvl,total_dwelling_units,1room_sold,2room_sold,3room_sold,...,sec_sch_name_Xinmin Secondary School,sec_sch_name_Yio Chu Kang Secondary School,sec_sch_name_Yishun Secondary School,sec_sch_name_Yishun Town Secondary School,sec_sch_name_Yuan Ching Secondary School,sec_sch_name_Yuhua Secondary School,sec_sch_name_Yusof Ishak Secondary School,sec_sch_name_Yuying Secondary School,sec_sch_name_Zhenghua Secondary School,sec_sch_name_Zhonghua Secondary School
9509,-1.282397,0.547039,-0.772735,1.033993,1.523504,0.137305,1.24263,-0.022425,-0.152469,2.279035,...,0,0,0,0,0,0,0,0,0,0
72510,-1.241408,-0.542544,1.321139,0.491376,0.781344,-0.343176,0.916697,-0.022425,-0.152469,1.519178,...,0,0,0,0,0,0,0,0,0,0
79362,0.93101,0.183845,1.321139,-1.136476,-0.538053,-0.503336,-0.181183,-0.022425,-0.152469,-0.57043,...,0,0,0,0,0,0,0,0,0,0
120303,-0.257671,-1.268933,1.620263,-1.136476,0.616419,-1.784617,-1.862312,-0.022425,-0.152469,-0.453529,...,0,0,0,0,0,0,0,0,0,0
66702,-0.954485,-1.268933,-0.47361,-0.593858,1.276117,-0.183015,-0.318418,-0.022425,-0.152469,0.890834,...,0,0,0,0,0,0,0,0,0,0


In [28]:
# Combine Z_test and X_test_num along axis=1 (column-wise)
Z_test = pd.concat([X_test_num, X_test_cat], axis=1)

In [29]:
print(Z_test.shape)
Z_test.head()

(45191, 1087)


Unnamed: 0,floor_area_sqm,tranc_year,tranc_month,mid_storey,hdb_age,max_floor_lvl,total_dwelling_units,1room_sold,2room_sold,3room_sold,...,sec_sch_name_Xinmin Secondary School,sec_sch_name_Yio Chu Kang Secondary School,sec_sch_name_Yishun Secondary School,sec_sch_name_Yishun Town Secondary School,sec_sch_name_Yuan Ching Secondary School,sec_sch_name_Yuhua Secondary School,sec_sch_name_Yusof Ishak Secondary School,sec_sch_name_Yuying Secondary School,sec_sch_name_Zhenghua Secondary School,sec_sch_name_Zhonghua Secondary School
107690,-0.29866,-0.17935,-0.772735,-0.051241,-0.620515,-0.022855,-0.215492,-0.022425,-0.152469,-0.57043,...,0,0,0,0,0,0,0,0,0,0
100411,1.012988,-0.905738,-0.772735,-0.051241,-0.538053,0.137305,-1.124674,-0.022425,-0.152469,-0.57043,...,0,0,0,0,0,0,0,0,0,0
23295,1.955736,-0.17935,1.022014,0.491376,-0.043279,-0.343176,-1.313372,-0.022425,-0.152469,-0.57043,...,0,0,0,0,0,0,0,0,0,0
68880,-1.323386,1.273428,1.022014,-0.051241,1.35858,0.137305,0.899543,-0.022425,-0.152469,1.97217,...,0,0,0,0,0,0,0,0,0,0
88677,1.135956,0.183845,-0.47361,0.491376,-0.043279,-0.343176,-0.901667,-0.022425,-0.152469,-0.57043,...,0,0,0,0,0,0,0,0,0,0


---
# 3.2 Models

## 3.2.0 Baseline Linear Regression
We create a baseline LR model with y_bar mean values.

In [88]:
#Calculate the mean to get our y_bar values
y_bar = np.mean(y_test) 

# Create an array of mean values with the same shape as y
y_pred_baseline_mean = np.full_like(y_test, y_bar) 

In [89]:
# Calculate the RMSE for the baseline mean y model
rmse_baseline_mean = np.sqrt(mean_squared_error(y_test, y_pred_baseline_mean))
rmse_baseline_mean

142460.91138205238

As expected, the baseline model based on mean y values, performs badly with a RMSE score of 142460.

---
## 3.2.1 Linear Regression

Instantiate and fit the model:

In [30]:
lr = LinearRegression()
lr.fit(Z_train, y_train);

Model evaluation:

In [31]:
# Train score
lr.score(Z_train, y_train)

0.9280381467586499

In [32]:
# Test score
lr.score(Z_test, y_test)

0.9262012868805757

In [33]:
cross_val_score(lr, Z_train, y_train).mean()

-49964236746928.98

In [34]:
lr.coef_

array([ 7.14810769e+04, -1.83892772e+04, -2.51267219e+03, ...,
       -1.52437821e+13, -1.21854871e+14, -4.36091067e+13])

In [35]:
#Get y_pred for train
y_train_preds = lr.predict(Z_train)

# Check the RMSE on the test set
np.sqrt(metrics.mean_squared_error(y_train, y_train_preds))

38539.188597016415

In [36]:
#Get y_pred for test
y_test_preds = lr.predict(Z_test)

# Check the RMSE on the test set
np.sqrt(metrics.mean_squared_error(y_test, y_test_preds))

38700.81516550692

---
## 3.2.2 Ridge Regression

In [None]:
ridge_final = joblib.load('./joblib/ridge_model.joblib')

In [37]:
# Set up a list of ridge alphas to check.
# np.logspace generates 100 values equally between 0 and 5,
# then converts them to alphas between 10^0 and 10^5.
r_alphas = np.logspace(0, 5, 100)

# Cross-validate over our list of ridge alphas.
ridge_cv = RidgeCV(alphas=r_alphas, cv=5).fit(Z_train, y_train)

print(ridge_cv.get_params)

<bound method BaseEstimator.get_params of RidgeCV(alphas=array([1.00000000e+00, 1.12332403e+00, 1.26185688e+00, 1.41747416e+00,
       1.59228279e+00, 1.78864953e+00, 2.00923300e+00, 2.25701972e+00,
       2.53536449e+00, 2.84803587e+00, 3.19926714e+00, 3.59381366e+00,
       4.03701726e+00, 4.53487851e+00, 5.09413801e+00, 5.72236766e+00,
       6.42807312e+00, 7.22080902e+00, 8.11130831e+00, 9.11162756e+00,
       1.02353102e+01, 1.14975700e+0...
       6.89261210e+03, 7.74263683e+03, 8.69749003e+03, 9.77009957e+03,
       1.09749877e+04, 1.23284674e+04, 1.38488637e+04, 1.55567614e+04,
       1.74752840e+04, 1.96304065e+04, 2.20513074e+04, 2.47707636e+04,
       2.78255940e+04, 3.12571585e+04, 3.51119173e+04, 3.94420606e+04,
       4.43062146e+04, 4.97702356e+04, 5.59081018e+04, 6.28029144e+04,
       7.05480231e+04, 7.92482898e+04, 8.90215085e+04, 1.00000000e+05]),
        cv=5)>


In [None]:
#Instantiate Ridge regression
ridge01 = Ridge(alpha=ridge_cv.alpha_)

#crossval score for ridge based on alpha from ridge_cv
ridge_scores = cross_val_score(ridge01,  Z_train, y_train, cv=5)
ridge_scores.mean()

In [39]:
# Evaluate model using R2.
print(ridge_cv.score(Z_train, y_train))
print(ridge_cv.score(Z_test, y_test))

0.9279383262128875
0.9261355632795203


In [40]:
#Get y_pred for train
ridge_y_train_pred = ridge_cv.predict(Z_train)

# Check the RMSE for train
np.sqrt(metrics.mean_squared_error(y_train, ridge_y_train_pred))

38565.90879331237

In [41]:
#Get y_pred for test
ridge_y_pred = ridge_cv.predict(Z_test)

# Check the RMSE for test
np.sqrt(metrics.mean_squared_error(y_test, ridge_y_pred))

38718.04440143296

---
## 3.2.3 Lasso Regression

Keep same train test split as linear regression model.

Lasso alpha-checks and model fitting:

In [42]:
# Set up a list of Lasso alphas to check.
l_alphas = np.logspace(-6, 6, 100)

# Cross-validate over our list of Lasso alphas.
lasso_cv = LassoCV(alphas=l_alphas, cv=5, max_iter=50000, tol=1e14).fit(Z_train, y_train)

In [43]:
#Instantiate lasso regression
lasso01 = Lasso(alpha=lasso_cv.alpha_)

#crossval score for lasso based on alpha from lasso_cv
lasso_scores = cross_val_score(lasso01,  Z_train, y_train, cv=5)
lasso_scores.mean()

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


0.9263294494933151

In [44]:
print(lasso_cv.score(Z_train, y_train))
print(lasso_cv.score(Z_test, y_test))

0.8559382744985202
0.8533538160159182


In [45]:
#Get y_pred for train
lasso_y__train_pred = lasso_cv.predict(Z_train)

# Check the RMSE for train
np.sqrt(metrics.mean_squared_error(y_train, lasso_y__train_pred))

54528.77022939298

In [46]:
#Get y_pred for test
lasso_y_test_pred = lasso_cv.predict(Z_test)

# Check the RMSE for test
np.sqrt(metrics.mean_squared_error(y_test, lasso_y_test_pred))

54554.565542606295

## 3.2.4 Model Exports

In [47]:
joblib.dump(lr, './joblib/lr_model.joblib')

['./joblib/lr_model.joblib']

In [48]:
joblib.dump(Ridge, './joblib/ridge_model.joblib')

['./joblib/ridge_model.joblib']

In [49]:
joblib.dump(LassoCV, './joblib/lasso_model.joblib')

['./joblib/lasso_model.joblib']

---
## 3.2.5 Summary of Model Metrics

|Model|R-square Train score|R-square Test Score|Cross Val Score|Train RMSE|Test RMSE|
|---|---|---|---|---|---|
|**Linear Regression**|0.9280|0.9262|3.327 × 1e14|38539.1887|38700.8151|
|**Ridge**|0.9289|0.9261|0.9262|38565.9087|38718.0444|
|**Lasso**|0.8559|0.8533|0.9263|54528.7702|54554.5655|

From the model summary, we can conclude that the ridge model is the most well-rounded in terms of cross-val score, R-square score and RSME, and hence performed the best. The ridge model had near lowest RMSE for both the train and the test model, with very little variation between both scores. Test RMSE is just slightly greater than train RMSE, indicating almost no signs of underfitting, and an absence of overfitting. On the test data, the ridge model and its independent variables could explain 92.6% of variance in resale price. For our selected ridge model, the final test RMSE tells us that there could be a potential error amount of $38,718.

While the linear regression model obtained lower RMSE scores compared to ridge, that model was rejected due to its high cross val score, meaning the model was unstable over the many variations. Hence we can deem the low RMSE to be inaccurate. We also rejected the lasso model due to its convergence warnings and higher train and test RMSE scores.

However, it is important to consider that regularised regression models are not as helpful for interpretation as compared to a linear regression model with no hyperparameters, as there is no longer a direct relationship between features and the predictor due to the scaling.

---
# Conclusion

#### Addressing the Problem Statement

From this project, we are able to arrive at a two-pronged approach in addressing the problem statement and its relevant questions. The EDA covers some basic comparative market analysis (CMA) that allows homeowners to have a better understanding of the market, and how certain features affect home resale price. The next part, the model itself, allows homeowners on the platform to input various values for the different variables, and the model would provide a prediction of potential resale price. These 2 approaches are able to mediate the knowledge gap that homeowners have if they decide to take a "For Sale By Owner" (FSBO) approach in selling their house. 

---
#### Findings from our EDA and Model

From the EDA:
We notice multicollinearity is the biggest cause of concern. Although we have attempted to eliminiate multicollinearity through feature selection and use of regularization, the effect cannot be completely eliminated.

Here are some other findings that would be useful for homeowners: 

1. Resale prices<br>
We see mean resale prices falling from mid 2013 to 2015, and then a drastic increase in mid 2020. This is proably due to government cooling measures, and changes in COVID-19 lifestyles respectively.
2. Distribution of resale prices<br>
Majority of homes are priced on the lower end, with a right-skewed pattern. This indicates that a significant number of homes have lower prices, while relatively fewer homes have higher prices. The median resale price, which is around $400,000, represents the middle value in the distributon.

3. Smaller flats of special models could outperform larger flats in terms of resale price
4. Properties located on higher storeys commanded a higher resale price.
6. Age of  property is negatively correlated to resale price.
7. Maximum floor level of the block that a unit is located in, is positively correlated to resale price.

<br>
The ridge model had near lowest RMSE for both the train and the test model, with very little variation between both scores. Test RMSE is just slightly greater than train RMSE, indicating almost no signs of underfitting, and an absence of overfitting. On the test data, the ridge model and its independent variables could explain 92.6% of variance in resale price. For our selected ridge model, the final test RMSE tells us that there could be a potential error amount of $38,718. However, to note that interpretability of this model is less effective as compared to the vanilla linear regression.


---
#### Limitations of the Model
The model we trained, is effective in providing a first cut estimation in predicting property resale price. However, we cannot deny that there several limitations of the model:

**1) Limitations of data collected<br>**
The kaggle dataset used for training the model has limitations in terms of its scope and recency. As the data only goes up to April 2011, it does not capture the most recent market trends and developments. Real estate markets are dynamic and can experience significant changes over time, making the predictions less accurate for the current market scenario.

Another subpoint is that post-2021, we see the world facing another sociological shift as we resame a "new normal" after covid. This could make it difficult to train the model as there is no past precedences of these new changes that may impact housing.

**2) Limited features considered in the dataset<br>**
The dataset may not encompass all the relevant features that impact property resale prices. Other variables such as crime rates, or unique neighborhood characteristics (E.g. Yishun is stigmatized to be a weird part of Singapore) that could significantly influence property values. 

The kaggle dataset features encompass the HDB itself, the neighbourhood, and surrounding amenities. However, there are no features documenting characteristics of the current lived-in unit itself. For example, whether walls have been knocked down (so a 5-bedroom flat now only has 3 bedrooms), or whether the unit comes furnished or unfurnished, whether there are unique features within the house (e.g. owner installed a jacuzzi). These features may play quite a key role in influencing the resale price of the house.

**3) Changing Market Dynamics, Government policies<br>**
The housing market is subject to fluctuations and changing dynamics, which can affect the relevance of historical data. In Singapore, The Singaporean government often implements policy changes related to the housing market, such as cooling measures or tax reforms. These policy shifts can cause sudden fluctuations in housing prices, making it challenging for regression models to capture and adapt to such dynamic changes. The model's ability to predict future prices could be affected if the market experiences unexpected shifts, or should the government implement new policies impacting housing.


In general, housing prices often exhibit complex, non-linear relationships with predictors. No model would be able to capture such intricate patterns, so it is important to set a disclaimer to let owners know that the model can only act as a guideline.

---
#### Recommendations
Some recommendations include:
* **Data Enhancement<br>** Update the dataset with more recent data to reflect current market conditions. Consider obtaining real-time or periodic data on property transactions and market trends to capture the most up-to-date information.

* **Considering other property types<br>** Since carousell property caters to all homes in general, further models will have to be built for different categories of properties (e.g., HDB flats, private condominiums, landed properties). It is good that the models are separate for each property type to allow the models to capture the unique characteristics and price drivers of each property type more accurately.
  
* **Incorporate External Factors<br>** Take into account external economic indicators, interest rates, and government policies in the modeling process. These factors can significantly impact the real estate market, and incorporating them can improve the model's predictions under changing market dynamics.

---

# Kaggle Export

Import cleaned up test set (tidied up and transformed in notebook 4):

In [63]:
test_final = pickle.load(open('./pkl/test_final.pkl', 'rb'))

In [64]:
y_test_pred_final = ridge_cv.predict(test_final)

print(len(y_test_pred_final))

16737


In [79]:
results = pd.DataFrame(y_test_pred_final, columns=['Predicted'])
print(y_test_pred_final.shape)
results.head()

(16737,)


Unnamed: 0,Predicted
0,335079.062923
1,481619.874306
2,342752.696451
3,316513.909156
4,475687.006854


In [80]:
test_forindex = pd.read_csv('./data/test.csv', low_memory=False)
ID = test_forindex[['id']]

ID.head()

Unnamed: 0,id
0,114982
1,95653
2,40303
3,109506
4,100149


In [81]:
subm = pd.concat([ID, results], axis=1)
subm.rename(columns={'id' : 'Id'}, inplace=True)
subm.head()

Unnamed: 0,Id,Predicted
0,114982,335079.062923
1,95653,481619.874306
2,40303,342752.696451
3,109506,316513.909156
4,100149,475687.006854


Export to CSV for submission to kaggle:

In [77]:
import os # to work with files/directories
if not os.path.exists('./data/output'): 
    os.makedirs('./data/output') 

# Save the DataFrame to a CSV file
subm.to_csv('./data/output/subm.csv', index=False)