# <font face = 'Palatino Linotype' color = '#274472'> Real Estate Valuation using Machine Learning <font/>
### <font face = 'Palatino Linotype' color = '#5885AF'> Data Scientists: Paolo Hilado and Alison Danvers<font/>

<font face = 'Palatino Linotype' color = '#5885AF'> Scenario:<font/>
   
<font face = 'Palatino Linotype'> Data Scientists were tasked with developing a machine learning model that will be used to estimate real estate based on provided explanatory variables. It is based on market historical dataset of real estate valuation collected from Sindian Dist. New Taipei City. <font/>

<font face = 'Palatino Linotype' color = '#5885AF'> Business Understanding:<font/>
   
<font face = 'Palatino Linotype'> Sindian District New Taipei City is an urbanized city in Taiwan. In this project the explanatory variables considered to estimate house price include transaction date, house age, distance from the nearest Mass Rapid Transit (MRT), number of convenience stores nearby, and the latitude and longitude of the property.<font/>

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split # used for training and testing a model
import math # used to separate the whole number from the decimal values

In [2]:
df = pd.read_excel("DataSet.xlsx")
df.head()

Unnamed: 0,No,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude,Y house price of unit area
0,1,2012.916667,32.0,84.87882,10,24.98298,121.54024,37.9
1,2,2012.916667,19.5,306.5947,9,24.98034,121.53951,42.2
2,3,2013.583333,13.3,561.9845,5,24.98746,121.54391,47.3
3,4,2013.5,13.3,561.9845,5,24.98746,121.54391,54.8
4,5,2012.833333,5.0,390.5684,5,24.97937,121.54245,43.1


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 414 entries, 0 to 413
Data columns (total 8 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   No                                      414 non-null    int64  
 1   X1 transaction date                     414 non-null    float64
 2   X2 house age                            414 non-null    float64
 3   X3 distance to the nearest MRT station  414 non-null    float64
 4   X4 number of convenience stores         414 non-null    int64  
 5   X5 latitude                             414 non-null    float64
 6   X6 longitude                            414 non-null    float64
 7   Y house price of unit area              414 non-null    float64
dtypes: float64(6), int64(2)
memory usage: 26.0 KB


In [4]:
df.eq(' ').any()

No                                        False
X1 transaction date                       False
X2 house age                              False
X3 distance to the nearest MRT station    False
X4 number of convenience stores           False
X5 latitude                               False
X6 longitude                              False
Y house price of unit area                False
dtype: bool

<font face = 'Palatino Linotype' color = '#5885AF'> Data Understanding:<font/>
   
<font face = 'Palatino Linotype'> The dataframe has 8 features (7 explanatory variables and 1 outcome variable) and 414 observations. With the given dataset, the transaction date refers to the year and the corresponding month. The decimal values are derived by having the month represented by a number (i.e., January = 1, February =2, etc.) divided by the total number of months in a year. It is presented as a continuous variable in the dataset such that 2013.250 = 2013 March, 2013.500 = 2013 June. The house age is on a per year unit, distance to the nearest MRT station is measured in meters, the number of convenience stores refers to those accessible within a given area in a walking distance, and the latitude and longitude refers to the coordinates of the property. It can also be observed that there are no missing or empty cases in the dataframe.<font/>

<font face = 'Palatino Linotype' color = '#5885AF'> Data Preparation<font/>

In [5]:
# Drop the irrelevant feature for developing the machine learning model.
df = df.drop(['No'], axis = 1)
df.head()

Unnamed: 0,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude,Y house price of unit area
0,2012.916667,32.0,84.87882,10,24.98298,121.54024,37.9
1,2012.916667,19.5,306.5947,9,24.98034,121.53951,42.2
2,2013.583333,13.3,561.9845,5,24.98746,121.54391,47.3
3,2013.5,13.3,561.9845,5,24.98746,121.54391,54.8
4,2012.833333,5.0,390.5684,5,24.97937,121.54245,43.1


In [6]:
# Provide shorter names for the columns.
df = df.rename(columns = {'X1 transaction date':'t.date', 'X2 house age':'h.age', 
                    'X3 distance to the nearest MRT station':'dist.mrt', 
                    'X4 number of convenience stores':'no.stores',
                    'X5 latitude':'lat', 
                    'X6 longitude':'long', 
                    'Y house price of unit area':'price'})
df.head()

Unnamed: 0,t.date,h.age,dist.mrt,no.stores,lat,long,price
0,2012.916667,32.0,84.87882,10,24.98298,121.54024,37.9
1,2012.916667,19.5,306.5947,9,24.98034,121.53951,42.2
2,2013.583333,13.3,561.9845,5,24.98746,121.54391,47.3
3,2013.5,13.3,561.9845,5,24.98746,121.54391,54.8
4,2012.833333,5.0,390.5684,5,24.97937,121.54245,43.1


In [7]:
# Split the dataset into train and test sets.
# Given 6 explanatory variables we would at need > 98 observations for
# training a regression model (Tabachnick and Fidell, 2013). The 70-30 split
# will be used for this project. 
train, test = train_test_split(df, test_size=0.30, random_state=0)
print(f'''The number of records for the train set is {len(train)}.
The number of records for the test set is {len(test)}.''')
# Source: Tabachnick, B.G.,Fidell, L.S., 2013. Using Multivariate Statistics, 
#         6th ed. Pearson Education, Inc., Boston. 

The number of records for the train set is 289.
The number of records for the test set is 125.


In [8]:
# Separating the explanatory variables from the outcome variable.
x_train = train.drop(['price'], axis = 1)
y_train = train['price']
x_train.head()

Unnamed: 0,t.date,h.age,dist.mrt,no.stores,lat,long
294,2013.5,26.4,335.5273,6,24.9796,121.5414
96,2013.416667,6.4,90.45606,9,24.97433,121.5431
377,2013.333333,3.9,49.66105,8,24.95836,121.53756
89,2013.5,23.0,3947.945,0,24.94783,121.50243
233,2013.333333,39.7,333.3679,9,24.98016,121.53932


In [9]:
# Separating the explanatory variables from the outcome variable.
x_test = test.drop(['price'], axis = 1)
y_test = test['price']
x_test.head()

Unnamed: 0,t.date,h.age,dist.mrt,no.stores,lat,long
356,2012.833333,10.3,211.4473,1,24.97417,121.52999
170,2013.333333,24.0,4527.687,0,24.94741,121.49628
224,2013.333333,34.5,324.9419,6,24.97814,121.5417
331,2013.333333,25.6,4519.69,0,24.94826,121.49587
306,2013.5,14.4,169.9803,1,24.97369,121.52979


In [15]:
# Standardize all the continuous variables.
from sklearn.preprocessing import StandardScaler

# Assigning feature labels to variable continuous_vars.
continuous_vars = ['t.date','h.age', 'dist.mrt', 'no.stores','lat','long']

# Initialize StandardScaler.
scaler = StandardScaler()

# Fit scaler to the continuous variables and transform them.
x_train[continuous_vars] = scaler.fit_transform(x_train[continuous_vars])

In [16]:
# Standardize all the continuous variables.
from sklearn.preprocessing import StandardScaler

# Assuming you have your data in a DataFrame called df with continuous variables
# Replace continuous_vars with the names of your continuous variables
continuous_vars = ['t.date','h.age', 'dist.mrt', 'no.stores','lat','long']

# Fit scaler to the continuous variables and transform them
x_test[continuous_vars] = scaler.transform(x_test[continuous_vars])

In [17]:
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Checking for Multicollinearity among continuous variables using the Variance Inflation Factor.
# Results show that there is no multicollinearity as the VIF for the continuous variables
# are less than 5.
X = sm.add_constant(x_train.iloc[:,0:6]) 
# Calculate VIF for each predictor variable
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)

     feature       VIF
0      const  1.000000
1     t.date  1.021891
2      h.age  1.013221
3   dist.mrt  4.281389
4  no.stores  1.660112
5        lat  1.566731
6       long  2.840975


In [18]:
# Training a machine learning model for a regression problem using the x_train dataset and the
# outcome variable y_train.
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge # You can replace Ridge with any other regression model you want to tune
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')

# Assuming you have your features in X and target variable in y

# Define Ridge regression model
ridge = Ridge()

# Define hyperparameters to tune
param_grid = {
    'alpha': [0.01, 0.1, 1.0, 10.0],  # Regularization strength (L2 penalty)
    'solver': ['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']  # Solver options
}
# Perform cross-validation grid search
grid_search = GridSearchCV(estimator=ridge, param_grid=param_grid, cv=5) # cv=5 for 5-fold cross-validation
grid_search.fit(x_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)

# Get the best model
best_model = grid_search.best_estimator_

# Evaluate the best model on the train set using RMSE
y_train_pred = best_model.predict(x_train)
rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)  # RMSE on train set
print("Root Mean Squared Error on train set:", np.round(rmse_train,2))

Best hyperparameters: {'alpha': 10.0, 'solver': 'sag'}
Root Mean Squared Error on train set: 8.94


In [19]:
# Checking model performance using the weighted mean absolute percentage error.

# Define the actual values (y_true) and predicted values (y_pred) for the test set
# Assuming you have already obtained these from your model
y_true = np.array([y_train])  # Replace [actual_values] with the actual values from your test set
y_pred = np.array([y_train_pred])  # Replace [predicted_values] with the predicted values from your model

# Compute the absolute percentage errors
absolute_percentage_errors = np.abs((y_true - y_pred) / y_true)

# Compute the weighted mean absolute percentage error (WMAPE)
mape = (np.sum(absolute_percentage_errors) / len(y_train)) * 100

print("Weighted Mean Absolute Percentage Error (WMAPE):", np.round(mape,2))
# trying out other models due to terrible performance. 

Weighted Mean Absolute Percentage Error (WMAPE): 18.11


In [20]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso
from sklearn.pipeline import Pipeline
import warnings

# Ignore all warnings
warnings.filterwarnings("ignore")

# Assuming you have your features in X and target variable in y

# Define the Lasso regression model
lasso = Lasso()

# Define hyperparameters to tune
param_grid = {
    'alpha': [0.01, 0.1, 1.0, 10.0]  # Regularization strength
}

# Perform cross-validation grid search
grid_search = GridSearchCV(estimator=lasso, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(x_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)

# Get the best model
best_model = grid_search.best_estimator_

# Evaluate the best model on the train set using RMSE
y_train_pred = best_model.predict(x_train)
rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)  # RMSE on train set
print("Root Mean Squared Error on train set:", np.round(rmse_train,2))

Best hyperparameters: {'alpha': 0.1}
Root Mean Squared Error on train set: 8.94


In [21]:
# Checking model performance using the weighted mean absolute percentage error.

# Define the actual values (y_true) and predicted values (y_pred) for the test set
# Assuming you have already obtained these from your model
y_true = np.array([y_train])  # Replace [actual_values] with the actual values from your test set
y_pred = np.array([y_train_pred])  # Replace [predicted_values] with the predicted values from your model

# Compute the absolute percentage errors
absolute_percentage_errors = np.abs((y_true - y_pred) / y_true)

# Compute the weighted mean absolute percentage error (WMAPE)
mape = (np.sum(absolute_percentage_errors) / len(y_train)) * 100

print("Weighted Mean Absolute Percentage Error (WMAPE):", np.round(mape,2))
# trying out other models due to terrible performance. 

Weighted Mean Absolute Percentage Error (WMAPE): 18.13


In [22]:
# Performing Elastic Net Regression
# Import necessary libraries
import numpy as np
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_regression

# Split the data into training and testing sets
# (You should replace this with your own dataset)
# X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the hyperparameter grid for Elastic Net
parametersGrid = {
    "max_iter": [1, 5, 10],
    "alpha": [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
    "l1_ratio": np.arange(0.0, 1.0, 0.1)
}

# Initialize the Elastic Net model
eNet = ElasticNet()

# Perform grid search to find the best hyperparameters
grid_search  = GridSearchCV(eNet, parametersGrid, scoring='neg_mean_squared_error', cv=5)
grid_search.fit(x_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)

# Get the best model
best_model = grid_search.best_estimator_

# Evaluate the best model on the train set using RMSE
y_train_pred = best_model.predict(x_train)
rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)  # RMSE on train set
print("Root Mean Squared Error on train set:", np.round(rmse_train,2))

Best hyperparameters: {'alpha': 0.1, 'l1_ratio': 0.2, 'max_iter': 5}
Root Mean Squared Error on train set: 8.95


In [23]:
# Checking model performance using the weighted mean absolute percentage error.

# Define the actual values (y_true) and predicted values (y_pred) for the test set
# Assuming you have already obtained these from your model
y_true = np.array([y_train])  # Replace [actual_values] with the actual values from your test set
y_pred = np.array([y_train_pred])  # Replace [predicted_values] with the predicted values from your model

# Compute the absolute percentage errors
absolute_percentage_errors = np.abs((y_true - y_pred) / y_true)

# Compute the weighted mean absolute percentage error (WMAPE)
mape = (np.sum(absolute_percentage_errors) / len(y_train)) * 100

print("Weighted Mean Absolute Percentage Error (WMAPE):", np.round(mape,2))
# trying out other models due to terrible performance. 

Weighted Mean Absolute Percentage Error (WMAPE): 18.12


In [24]:
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.ensemble import RandomForestRegressor

# Define Random Forest regressor
rf_regressor = RandomForestRegressor()

# Define hyperparameters grid
param_grid = {
    'n_estimators': [10, 50, 100],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Define GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(estimator=rf_regressor, param_grid=param_grid, cv=5)

# Perform GridSearchCV
grid_search.fit(x_train, y_train)

# Print best hyperparameters
print("Best Hyperparameters:", grid_search.best_params_)
# Get the best model
best_model = grid_search.best_estimator_
# Evaluate the best model on the train set using RMSE
y_train_pred = best_model.predict(x_train)
rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)  # RMSE on train set
print("Root Mean Squared Error on train set:", np.round(rmse_train,2))

Best Hyperparameters: {'max_depth': 20, 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 50}
Root Mean Squared Error on train set: 5.4


In [25]:
# Checking model performance using the weighted mean absolute percentage error.

# Define the actual values (y_true) and predicted values (y_pred) for the test set
# Assuming you have already obtained these from your model
y_true = np.array([y_train])  # Replace [actual_values] with the actual values from your test set
y_pred = np.array([y_train_pred])  # Replace [predicted_values] with the predicted values from your model

# Compute the absolute percentage errors
absolute_percentage_errors = np.abs((y_true - y_pred) / y_true)

# Compute the weighted mean absolute percentage error (WMAPE)
mape = (np.sum(absolute_percentage_errors) / len(y_train)) * 100

print("Weighted Mean Absolute Percentage Error (WMAPE):", np.round(mape,2))
# trying out other models due to terrible performance.

Weighted Mean Absolute Percentage Error (WMAPE): 9.07


In [26]:
# Evaluate the best model on the test set using RMSE
y_test_pred = best_model.predict(x_test)
rmse_test = mean_squared_error(y_test, y_test_pred, squared=False)  # RMSE on train set
print("Root Mean Squared Error on test set:", np.round(rmse_test,2))

Root Mean Squared Error on test set: 7.07


In [27]:
# Checking model performance using the weighted mean absolute percentage error.

# Define the actual values (y_true) and predicted values (y_pred) for the test set
# Assuming you have already obtained these from your model
y_true = np.array([y_test])  # Replace [actual_values] with the actual values from your test set
y_pred = np.array([y_test_pred])  # Replace [predicted_values] with the predicted values from your model

# Compute the absolute percentage errors
absolute_percentage_errors = np.abs((y_true - y_pred) / y_true)

# Compute the weighted mean absolute percentage error (WMAPE)
mape = (np.sum(absolute_percentage_errors) / len(y_test)) * 100

print("Weighted Mean Absolute Percentage Error (WMAPE):", np.round(mape,2))
# trying out other models due to terrible performance.

Weighted Mean Absolute Percentage Error (WMAPE): 14.88


In [28]:
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')

# Define the XGBoost regressor
xgb_regressor = xgb.XGBRegressor()

# Define hyperparameters to tune
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 4, 5],
    'learning_rate': [0.01, 0.1, 0.3],
    'gamma': [0, 0.1, 0.2],
    'subsample': [0.8, 0.9, 1.0],
    'colsample_bytree': [0.8, 0.9, 1.0],
    'reg_alpha': [0, 0.1, 0.5],
    'reg_lambda': [0, 0.1, 0.5]
}

# Perform cross-validation grid search
grid_search = GridSearchCV(estimator=xgb_regressor, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(x_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)

# Get the best model
best_model = grid_search.best_estimator_

# Evaluate the best model on the train set using RMSE
y_train_pred = best_model.predict(x_train)
rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)  # RMSE on train set
print("Root Mean Squared Error on train set:", np.round(rmse_train,2))

Best hyperparameters: {'colsample_bytree': 0.8, 'gamma': 0.2, 'learning_rate': 0.01, 'max_depth': 4, 'n_estimators': 300, 'reg_alpha': 0, 'reg_lambda': 0.5, 'subsample': 0.9}
Root Mean Squared Error on train set: 4.33


In [29]:
# Evaluate the best model on the test set using RMSE
y_test_pred = best_model.predict(x_test)
rmse_test = mean_squared_error(y_test, y_test_pred, squared=False)  # RMSE on train set
print("Root Mean Squared Error on test set:", np.round(rmse_test,2))

Root Mean Squared Error on test set: 7.08


In [30]:
# Checking model performance using the weighted mean absolute percentage error.

# Define the actual values (y_true) and predicted values (y_pred) for the test set
# Assuming you have already obtained these from your model
y_true = np.array([y_train])  # Replace [actual_values] with the actual values from your test set
y_pred = np.array([y_train_pred])  # Replace [predicted_values] with the predicted values from your model

# Compute the absolute percentage errors
absolute_percentage_errors = np.abs((y_true - y_pred) / y_true)

# Compute the weighted mean absolute percentage error (WMAPE)
mape = (np.sum(absolute_percentage_errors) / len(y_train)) * 100

print("Weighted Mean Absolute Percentage Error (WMAPE):", np.round(mape,2))

Weighted Mean Absolute Percentage Error (WMAPE): 9.15


In [32]:
# Save a copy of the Random Forest Model.
import pickle
pickle.dump(best_model, open('XGBmodel.pkl', 'wb'))

### Decision:

Having a an almost the same RMSE (7.08) but a lower WMAPE (9.15) on the test data compared to Random Forest Regressor, the XGBoost Model is preferred to provide estimates on housing cost given the explanatory variables such as transaction date, house age, its distance to the MRT station, number of nearby convenience store, latitude, and longitude.This is the model saved for future use in estimating house prices given the identified explanatory variables.