# <font face = 'Palatino Linotype' color = '#274472'> Real Estate Valuation using Machine Learning <font/>
### <font face = 'Palatino Linotype' color = '#5885AF'> Data Scientists: Paolo Hilado and Alison Danvers<font/>

<font face = 'Palatino Linotype' color = '#5885AF'> Scenario:<font/>
   
<font face = 'Palatino Linotype'> Data Scientists were tasked with developing a machine learning model that will be used to estimate real estate based on provided explanatory variables. It is based on market historical dataset of real estate valuation collected from Sindian Dist. New Taipei City. <font/>

<font face = 'Palatino Linotype' color = '#5885AF'> Business Understanding:<font/>
   
<font face = 'Palatino Linotype'> Sindian District New Taipei City is an urbanized city in Taiwan. In this project the explanatory variables considered to estimate house price include transaction date, house age, distance from the nearest Mass Rapid Transit (MRT), number of convenience stores nearby, and the latitude and longitude of the property.<font/>

In [12]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split # used for training and testing a model
import math # used to separate the whole number from the decimal values

In [13]:
df = pd.read_excel("DataSet.xlsx")
df.head()

Unnamed: 0,No,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude,Y house price of unit area
0,1,2012.916667,32.0,84.87882,10,24.98298,121.54024,379000
1,2,2012.916667,19.5,306.5947,9,24.98034,121.53951,422000
2,3,2013.583333,13.3,561.9845,5,24.98746,121.54391,473000
3,4,2013.5,13.3,561.9845,5,24.98746,121.54391,548000
4,5,2012.833333,5.0,390.5684,5,24.97937,121.54245,431000


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 414 entries, 0 to 413
Data columns (total 8 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   No                                      414 non-null    int64  
 1   X1 transaction date                     414 non-null    float64
 2   X2 house age                            414 non-null    float64
 3   X3 distance to the nearest MRT station  414 non-null    float64
 4   X4 number of convenience stores         414 non-null    int64  
 5   X5 latitude                             414 non-null    float64
 6   X6 longitude                            414 non-null    float64
 7   Y house price of unit area              414 non-null    int64  
dtypes: float64(5), int64(3)
memory usage: 26.0 KB


In [4]:
df.eq(' ').any()

No                                        False
X1 transaction date                       False
X2 house age                              False
X3 distance to the nearest MRT station    False
X4 number of convenience stores           False
X5 latitude                               False
X6 longitude                              False
Y house price of unit area                False
dtype: bool

<font face = 'Palatino Linotype' color = '#5885AF'> Data Understanding:<font/>
   
<font face = 'Palatino Linotype'> The dataframe has 8 features (7 explanatory variables and 1 outcome variable) and 414 observations. With the given dataset, the transaction date refers to the year and the corresponding month. The decimal values are derived by having the month represented by a number (i.e., January = 1, February =2, etc.) divided by the total number of months in a year. It is presented as a continuous variable in the dataset such that 2013.250 = 2013 March, 2013.500 = 2013 June. The house age is on a per year unit, distance to the nearest MRT station is measured in meters, the number of convenience stores refers to those accessible within a given area in a walking distance, and the latitude and longitude refers to the coordinates of the property. It can also be observed that there are no missing or empty cases in the dataframe.<font/>

<font face = 'Palatino Linotype' color = '#5885AF'> Data Preparation<font/>

In [14]:
# Drop the irrelevant feature for developing the machine learning model.
df = df.drop(['No'], axis = 1)
df.head()

Unnamed: 0,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude,Y house price of unit area
0,2012.916667,32.0,84.87882,10,24.98298,121.54024,379000
1,2012.916667,19.5,306.5947,9,24.98034,121.53951,422000
2,2013.583333,13.3,561.9845,5,24.98746,121.54391,473000
3,2013.5,13.3,561.9845,5,24.98746,121.54391,548000
4,2012.833333,5.0,390.5684,5,24.97937,121.54245,431000


In [15]:
# Provide shorter names for the columns.
df = df.rename(columns = {'X1 transaction date':'t.date', 'X2 house age':'h.age', 
                    'X3 distance to the nearest MRT station':'dist.mrt', 
                    'X4 number of convenience stores':'no.stores',
                    'X5 latitude':'lat', 
                    'X6 longitude':'long', 
                    'Y house price of unit area':'price'})
df.head()

Unnamed: 0,t.date,h.age,dist.mrt,no.stores,lat,long,price
0,2012.916667,32.0,84.87882,10,24.98298,121.54024,379000
1,2012.916667,19.5,306.5947,9,24.98034,121.53951,422000
2,2013.583333,13.3,561.9845,5,24.98746,121.54391,473000
3,2013.5,13.3,561.9845,5,24.98746,121.54391,548000
4,2012.833333,5.0,390.5684,5,24.97937,121.54245,431000


In [16]:
# Split the dataset into train and test sets.
# Given 6 explanatory variables we would at need > 98 observations for
# training a regression model (Tabachnick and Fidell, 2013). The 70-30 split
# will be used for this project. 
train, test = train_test_split(df, test_size=0.30, random_state=0)
print(f'''The number of records for the train set is {len(train)}.
The number of records for the test set is {len(test)}.''')
# Source: Tabachnick, B.G.,Fidell, L.S., 2013. Using Multivariate Statistics, 
#         6th ed. Pearson Education, Inc., Boston. 

The number of records for the train set is 289.
The number of records for the test set is 125.


In [17]:
# Separating the explanatory variables from the outcome variable.
x_train = train.drop(['price'], axis = 1)
y_train = train['price']
x_train.head()

Unnamed: 0,t.date,h.age,dist.mrt,no.stores,lat,long
294,2013.5,26.4,335.5273,6,24.9796,121.5414
96,2013.416667,6.4,90.45606,9,24.97433,121.5431
377,2013.333333,3.9,49.66105,8,24.95836,121.53756
89,2013.5,23.0,3947.945,0,24.94783,121.50243
233,2013.333333,39.7,333.3679,9,24.98016,121.53932


In [18]:
# Separating the explanatory variables from the outcome variable.
x_test = test.drop(['price'], axis = 1)
y_test = test['price']
x_test.head()

Unnamed: 0,t.date,h.age,dist.mrt,no.stores,lat,long
356,2012.833333,10.3,211.4473,1,24.97417,121.52999
170,2013.333333,24.0,4527.687,0,24.94741,121.49628
224,2013.333333,34.5,324.9419,6,24.97814,121.5417
331,2013.333333,25.6,4519.69,0,24.94826,121.49587
306,2013.5,14.4,169.9803,1,24.97369,121.52979


In [19]:
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Checking for Multicollinearity among continuous variables using the Variance Inflation Factor.
# Results show that there is no multicollinearity as the VIF for the continuous variables
# are less than 5.
X = sm.add_constant(x_train.iloc[:,0:6]) 
# Calculate VIF for each predictor variable
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)

     feature           VIF
0      const  2.361058e+08
1     t.date  1.021891e+00
2      h.age  1.013221e+00
3   dist.mrt  4.281389e+00
4  no.stores  1.660112e+00
5        lat  1.566731e+00
6       long  2.840975e+00


In [20]:
# Training a machine learning model for a regression problem using the x_train dataset and the
# outcome variable y_train.
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.pipeline import Pipeline

# Defining the pipeline for Ridge Regression.
pipeline = Pipeline([
    ('scaler', StandardScaler()),           
    ('ridge', Ridge())            
    ]
)

# Define hyperparameters to tune
param_grid = {
    'ridge__alpha': [0.01, 0.1, 1.0, 10.0],  # Regularization strength (L2 penalty)
    'ridge__solver': ['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']  # Solver options
}

# Define a custom RMSE scorer
rmse_scorer = make_scorer(lambda y_true, y_pred: np.sqrt(mean_squared_error(y_true, y_pred)),
                           greater_is_better=False)  # smaller RMSE is better

# Perform cross-validation grid search with RMSE scoring
grid_search = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=5,
    scoring=rmse_scorer,
    n_jobs=-1
)
grid_search.fit(x_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)

# Best CV RMSE
best_cv_rmse = -grid_search.best_score_ # negative because greater_is_better=False
print("Best Cross-Validation RMSE:", np.round(best_cv_rmse, 4))

# Get the best model
best_model = grid_search.best_estimator_

# RMSE on training set
y_train_pred = best_model.predict(x_train)
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
print("Train RMSE:", np.round(rmse_train, 4))

# RMSE on test set
y_test_pred = best_model.predict(x_test)
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))
print("Test RMSE:", np.round(rmse_test, 4))

Best hyperparameters: {'ridge__alpha': 10.0, 'ridge__solver': 'lsqr'}
Best Cross-Validation RMSE: 90495.2006
Train RMSE: 89430.1435
Test RMSE: 84622.3338


In [21]:
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.pipeline import Pipeline

# Defining the pipeline for Lasso Regression.
pipeline = Pipeline([
    ('scaler', StandardScaler()),           
    ('lasso', Lasso())            
    ]
)


# Define hyperparameters to tune
param_grid = {
    'lasso__alpha': [0.01, 0.1, 1.0, 10.0],
    'lasso__selection': ['cyclic', 'random']# Regularization strength
}

# Define a custom RMSE scorer
rmse_scorer = make_scorer(lambda y_true, y_pred: np.sqrt(mean_squared_error(y_true, y_pred)),
                           greater_is_better=False)  # smaller RMSE is better

# Perform cross-validation grid search
grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=5, scoring=rmse_scorer)
grid_search.fit(x_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)

# Best CV RMSE
best_cv_rmse = -grid_search.best_score_ # negative because greater_is_better=False
print("Best Cross-Validation RMSE:", np.round(best_cv_rmse, 4))

# Get the best model
best_model = grid_search.best_estimator_

# RMSE on training set
y_train_pred = best_model.predict(x_train)
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
print("Train RMSE:", np.round(rmse_train, 4))

# RMSE on test set
y_test_pred = best_model.predict(x_test)
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))
print("Test RMSE:", np.round(rmse_test, 4))

Best hyperparameters: {'lasso__alpha': 10.0, 'lasso__selection': 'random'}
Best Cross-Validation RMSE: 90751.0958
Train RMSE: 89372.0387
Test RMSE: 84603.2077


In [22]:
# Performing Elastic Net Regression
# Import necessary libraries
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.datasets import make_regression
from sklearn.pipeline import Pipeline

# Define the pipeline for Elastic Net
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('elastic', ElasticNet(random_state=42))
    ]
)

# Hyperparameters to tune
param_grid = {
    'elastic__alpha': [0.01, 0.1, 1.0, 10.0],       # regularization strength
    'elastic__l1_ratio': [0.1, 0.5, 0.7, 0.9, 1.0], # balance between L1 (Lasso) and L2 (Ridge)
    'elastic__max_iter': [5000, 10000]             # ensure convergence
}

# Define a custom RMSE scorer
rmse_scorer = make_scorer(lambda y_true, y_pred: np.sqrt(mean_squared_error(y_true, y_pred)),
                           greater_is_better=False)  # smaller RMSE is better

# Perform grid search to find the best hyperparameters
grid_search = GridSearchCV(estimator=pipeline, 
    param_grid=param_grid,
    cv=5,
    scoring=rmse_scorer,
    n_jobs=-1
)

# Fit to training data
grid_search.fit(x_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)

# Best CV RMSE (negate because greater_is_better=False)
best_cv_rmse = -grid_search.best_score_
print("Best Cross-Validation RMSE:", np.round(best_cv_rmse, 4))

# Get the best model
best_model = grid_search.best_estimator_


# RMSE on training set
y_train_pred = best_model.predict(x_train)
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
print("Train RMSE:", np.round(rmse_train, 4))

# RMSE on test set
y_test_pred = best_model.predict(x_test)
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))
print("Test RMSE:", np.round(rmse_test, 4))

Best hyperparameters: {'elastic__alpha': 1.0, 'elastic__l1_ratio': 0.9, 'elastic__max_iter': 5000}
Best Cross-Validation RMSE: 90398.8484
Train RMSE: 89694.8247
Test RMSE: 84852.6134


In [23]:
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.pipeline import Pipeline

# Define the pipeline for Random Forest
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # optional for tree-based models, but safe
    ('rf', RandomForestRegressor(random_state=42))
])

# Hyperparameters to tune
param_grid = {
    'rf__n_estimators': [100, 200, 500],         # number of trees
    'rf__max_depth': [None, 5, 10, 20],          # max depth of each tree
    'rf__min_samples_split': [2, 5, 10],         # min samples to split
    'rf__min_samples_leaf': [1, 2, 4]            # min samples per leaf
}

# Define custom RMSE scorer
rmse_scorer = make_scorer(lambda y_true, y_pred: np.sqrt(mean_squared_error(y_true, y_pred)),
                           greater_is_better=False)

# GridSearchCV
grid_search = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=5,
    scoring=rmse_scorer,
    n_jobs=-1
)

# Fit to training data
grid_search.fit(x_train, y_train)

# Best hyperparameters
best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)

# Best CV RMSE (negate because greater_is_better=False)
best_cv_rmse = -grid_search.best_score_
print("Best Cross-Validation RMSE:", np.round(best_cv_rmse, 4))

# Best model
best_model = grid_search.best_estimator_

# RMSE on training set
y_train_pred = best_model.predict(x_train)
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
print("Train RMSE:", np.round(rmse_train, 4))

# RMSE on test set
y_test_pred = best_model.predict(x_test)
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))
print("Test RMSE:", np.round(rmse_test, 4))

Best hyperparameters: {'rf__max_depth': 10, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 2, 'rf__n_estimators': 100}
Best Cross-Validation RMSE: 75207.1488
Train RMSE: 33826.0231
Test RMSE: 75920.2121


In [8]:
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.pipeline import Pipeline
import xgboost as xgb

# Define the pipeline for XGBoost
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # optional; XGBoost is tree-based
    ('xgb', xgb.XGBRegressor(objective='reg:squarederror', random_state=42))
])

# Hyperparameters to tune
param_grid = {
    'xgb__n_estimators': [100, 200, 500],
    'xgb__max_depth': [3, 5, 7],
    'xgb__learning_rate': [0.01, 0.1, 0.2],
    'xgb__subsample': [0.7, 0.8, 1.0],
    'xgb__colsample_bytree': [0.7, 0.8, 1.0]
}

# Define custom RMSE scorer
rmse_scorer = make_scorer(lambda y_true, y_pred: np.sqrt(mean_squared_error(y_true, y_pred)),
                           greater_is_better=False)

# GridSearchCV
grid_search = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=5,
    scoring=rmse_scorer,
    n_jobs=-1
)

# Fit to training data
grid_search.fit(x_train, y_train)

# Best hyperparameters
best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)

# Best CV RMSE
best_cv_rmse = -grid_search.best_score_
print("Best Cross-Validation RMSE:", np.round(best_cv_rmse, 4))

# Best model
best_model = grid_search.best_estimator_

# RMSE on training set
y_train_pred = best_model.predict(x_train)
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
print("Train RMSE:", np.round(rmse_train, 4))

# RMSE on test set
y_test_pred = best_model.predict(x_test)
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))
print("Test RMSE:", np.round(rmse_test, 4))

Best hyperparameters: {'xgb__colsample_bytree': 0.7, 'xgb__learning_rate': 0.01, 'xgb__max_depth': 3, 'xgb__n_estimators': 500, 'xgb__subsample': 0.7}
Best Cross-Validation RMSE: 76168.4599
Train RMSE: 46711.18
Test RMSE: 69916.053


In [9]:
# Save a copy of the Random Forest Model.
import pickle
pickle.dump(best_model, open('XGBmodel.pkl', 'wb'))

### Decision:

Amongst the machine learning models, the random forest and extreme gradient boost (XGBoost) perform well given low RMSE. Even though Random Forest has slightly better CV RMSE, the Test RMSE is worse (7.76 vs 6.99). Large gaps between train and test RMSE (like that of Random Forest) also indicates possible overfitting. XGBoost has a more balanced train/test error, which usually means itâ€™s more reliable on new data. This will be the model we will use for deployment. 