# <font face = 'Palatino Linotype' color = '#274472'> Real Estate Valuation using Machine Learning <font/>
### <font face = 'Palatino Linotype' color = '#5885AF'> Data Scientists: Paolo Hilado and Alison Danvers<font/>

<font face = 'Palatino Linotype' color = '#5885AF'> Scenario:<font/>
   
<font face = 'Palatino Linotype'> Data Scientists were tasked with developing a machine learning model that will be used to estimate real estate based on provided explanatory variables. It is based on market historical dataset of real estate valuation collected from Sindian Dist. New Taipei City. <font/>

<font face = 'Palatino Linotype' color = '#5885AF'> Business Understanding:<font/>
   
<font face = 'Palatino Linotype'> Sindian District New Taipei City is an urbanized city in Taiwan. In this project the explanatory variables considered to estimate house price include transaction date, house age, distance from the nearest Mass Rapid Transit (MRT), number of convenience stores nearby, and the latitude and longitude of the property.<font/>

In [12]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split # used for training and testing a model
import math # used to separate the whole number from the decimal values

In [2]:
df = pd.read_excel("DataSet.xlsx")
df.head()

Unnamed: 0,No,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude,Y house price of unit area
0,1,2012.916667,32.0,84.87882,10,24.98298,121.54024,37.9
1,2,2012.916667,19.5,306.5947,9,24.98034,121.53951,42.2
2,3,2013.583333,13.3,561.9845,5,24.98746,121.54391,47.3
3,4,2013.5,13.3,561.9845,5,24.98746,121.54391,54.8
4,5,2012.833333,5.0,390.5684,5,24.97937,121.54245,43.1


In [45]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 414 entries, 0 to 413
Data columns (total 8 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   No                                      414 non-null    int64  
 1   X1 transaction date                     414 non-null    float64
 2   X2 house age                            414 non-null    float64
 3   X3 distance to the nearest MRT station  414 non-null    float64
 4   X4 number of convenience stores         414 non-null    int64  
 5   X5 latitude                             414 non-null    float64
 6   X6 longitude                            414 non-null    float64
 7   Y house price of unit area              414 non-null    float64
dtypes: float64(6), int64(2)
memory usage: 26.0 KB


In [4]:
df.eq(' ').any()

No                                        False
X1 transaction date                       False
X2 house age                              False
X3 distance to the nearest MRT station    False
X4 number of convenience stores           False
X5 latitude                               False
X6 longitude                              False
Y house price of unit area                False
dtype: bool

<font face = 'Palatino Linotype' color = '#5885AF'> Data Understanding:<font/>
   
<font face = 'Palatino Linotype'> The dataframe has 8 features (7 explanatory variables and 1 outcome variable) and 414 observations. With the given dataset, the transaction date refers to the year and the corresponding month. The decimal values are derived by having the month represented by a number (i.e., January = 1, February =2, etc.) divided by the total number of months in a year. It is presented as a continuous variable in the dataset such that 2013.250 = 2013 March, 2013.500 = 2013 June. The house age is on a per year unit, distance to the nearest MRT station is measured in meters, the number of convenience stores refers to those accessible within a given area in a walking distance, and the latitude and longitude refers to the coordinates of the property. It can also be observed that there are no missing or empty cases in the dataframe.<font/>

<font face = 'Palatino Linotype' color = '#5885AF'> Data Preparation<font/>

In [3]:
# Drop the irrelevant feature for developing the machine learning model.
df = df.drop(['No'], axis = 1)
df.head()

Unnamed: 0,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude,Y house price of unit area
0,2012.916667,32.0,84.87882,10,24.98298,121.54024,37.9
1,2012.916667,19.5,306.5947,9,24.98034,121.53951,42.2
2,2013.583333,13.3,561.9845,5,24.98746,121.54391,47.3
3,2013.5,13.3,561.9845,5,24.98746,121.54391,54.8
4,2012.833333,5.0,390.5684,5,24.97937,121.54245,43.1


In [4]:
# Provide shorter names for the columns.
df = df.rename(columns = {'X1 transaction date':'t.date', 'X2 house age':'h.age', 
                    'X3 distance to the nearest MRT station':'dist.mrt', 
                    'X4 number of convenience stores':'no.stores',
                    'X5 latitude':'lat', 'X6 longitude':'long', 
                    'Y house price of unit area':'price'})
df.head()

Unnamed: 0,t.date,h.age,dist.mrt,no.stores,lat,long,price
0,2012.916667,32.0,84.87882,10,24.98298,121.54024,37.9
1,2012.916667,19.5,306.5947,9,24.98034,121.53951,42.2
2,2013.583333,13.3,561.9845,5,24.98746,121.54391,47.3
3,2013.5,13.3,561.9845,5,24.98746,121.54391,54.8
4,2012.833333,5.0,390.5684,5,24.97937,121.54245,43.1


In [5]:
# Split the dataset into train and test sets.
# Given 6 explanatory variables we would at need > 98 observations for
# training a regression model (Tabachnick and Fidell, 2013). The 70-30 split
# will be used for this project. 
train, test = train_test_split(df, test_size=0.30, random_state=0)
print(f'''The number of records for the train set is {len(train)}.
The number of records for the test set is {len(test)}.''')
# Source: Tabachnick, B.G.,Fidell, L.S., 2013. Using Multivariate Statistics, 
#         6th ed. Pearson Education, Inc., Boston. 

The number of records for the train set is 289.
The number of records for the test set is 125.


In [49]:
train.head()

Unnamed: 0,t.date,h.age,dist.mrt,no.stores,lat,long,price
294,2013.5,26.4,335.5273,6,24.9796,121.5414,38.1
96,2013.416667,6.4,90.45606,9,24.97433,121.5431,59.5
377,2013.333333,3.9,49.66105,8,24.95836,121.53756,56.8
89,2013.5,23.0,3947.945,0,24.94783,121.50243,25.3
233,2013.333333,39.7,333.3679,9,24.98016,121.53932,32.4


In [6]:
# Workout the transaction date so that the year and the months would be reflected as categorical variable
# instead of a continuous variable.
replace_values = {2013.0833333:'Jan-2013',
                    2013.1666667: 'Feb-2013', 2013.25: 'Mar-2013',
                    2013.3333333: 'Apr-2013', 2013.4166667: 'May-2013',
                    2013.5: 'Jun-2013', 2013.5833333: 'Jul-2013',
                    2012.6666667: 'Aug-2012', 2012.75: 'Sept-2012',
                    2012.8333333: 'Oct-2012', 2012.9166667: 'Nov-2012',
                    2013.0: 'Dec-2012'}   
train = train.replace({"t.date": replace_values})   

In [7]:
# Do a simple check.
train[train['t.date'] == 'Dec-2012']

Unnamed: 0,t.date,h.age,dist.mrt,no.stores,lat,long,price
66,Dec-2012,1.0,193.5845,6,24.96571,121.54089,50.7
350,Dec-2012,13.2,492.2313,5,24.96515,121.53737,42.3
196,Dec-2012,22.8,707.9067,2,24.981,121.54713,36.6
409,Dec-2012,13.7,4082.015,0,24.94155,121.50381,15.4
116,Dec-2012,30.9,6396.283,1,24.94375,121.47883,12.2
343,Dec-2012,33.5,563.2854,8,24.98223,121.53597,46.6
407,Dec-2012,5.2,2408.993,0,24.95505,121.55964,22.3
287,Dec-2012,19.2,461.1016,5,24.95425,121.5399,32.9
341,Dec-2012,13.0,750.0704,2,24.97371,121.54951,37.0
204,Dec-2012,18.0,1414.837,1,24.95182,121.54887,26.6


In [11]:
train.head()

Unnamed: 0,t.date,h.age,dist.mrt,no.stores,lat,long,price
294,Jun-2013,26.4,335.5273,6,24.9796,121.5414,38.1
96,May-2013,6.4,90.45606,9,24.97433,121.5431,59.5
377,Apr-2013,3.9,49.66105,8,24.95836,121.53756,56.8
89,Jun-2013,23.0,3947.945,0,24.94783,121.50243,25.3
233,Apr-2013,39.7,333.3679,9,24.98016,121.53932,32.4


In [8]:
# Creating a dataset for one-hot coding the categorical explanatory variable (t.date).
train_1 = pd.get_dummies(train, columns = ['t.date'], dtype=int)
train_1.head()

Unnamed: 0,h.age,dist.mrt,no.stores,lat,long,price,t.date_Apr-2013,t.date_Aug-2012,t.date_Dec-2012,t.date_Feb-2013,t.date_Jan-2013,t.date_Jul-2013,t.date_Jun-2013,t.date_Mar-2013,t.date_May-2013,t.date_Nov-2012,t.date_Oct-2012,t.date_Sept-2012
294,26.4,335.5273,6,24.9796,121.5414,38.1,0,0,0,0,0,0,1,0,0,0,0,0
96,6.4,90.45606,9,24.97433,121.5431,59.5,0,0,0,0,0,0,0,0,1,0,0,0
377,3.9,49.66105,8,24.95836,121.53756,56.8,1,0,0,0,0,0,0,0,0,0,0,0
89,23.0,3947.945,0,24.94783,121.50243,25.3,0,0,0,0,0,0,1,0,0,0,0,0
233,39.7,333.3679,9,24.98016,121.53932,32.4,1,0,0,0,0,0,0,0,0,0,0,0


In [9]:
# Separating the explanatory variables from the outcome variable.
x_train = train_1.drop(['price'], axis = 1)
y_train = train_1['price']
x_train.head()

Unnamed: 0,h.age,dist.mrt,no.stores,lat,long,t.date_Apr-2013,t.date_Aug-2012,t.date_Dec-2012,t.date_Feb-2013,t.date_Jan-2013,t.date_Jul-2013,t.date_Jun-2013,t.date_Mar-2013,t.date_May-2013,t.date_Nov-2012,t.date_Oct-2012,t.date_Sept-2012
294,26.4,335.5273,6,24.9796,121.5414,0,0,0,0,0,0,1,0,0,0,0,0
96,6.4,90.45606,9,24.97433,121.5431,0,0,0,0,0,0,0,0,1,0,0,0
377,3.9,49.66105,8,24.95836,121.53756,1,0,0,0,0,0,0,0,0,0,0,0
89,23.0,3947.945,0,24.94783,121.50243,0,0,0,0,0,0,1,0,0,0,0,0
233,39.7,333.3679,9,24.98016,121.53932,1,0,0,0,0,0,0,0,0,0,0,0


In [10]:
# Standardize all the continuous variables.
from sklearn.preprocessing import StandardScaler

# Assigning feature labels to variable continuous_vars.
continuous_vars = ['h.age', 'dist.mrt', 'no.stores','lat','long']

# Initialize StandardScaler.
scaler = StandardScaler()

# Fit scaler to the continuous variables and transform them.
x_train[continuous_vars] = scaler.fit_transform(x_train[continuous_vars])
x_train.head(5)

Unnamed: 0,h.age,dist.mrt,no.stores,lat,long,t.date_Apr-2013,t.date_Aug-2012,t.date_Dec-2012,t.date_Feb-2013,t.date_Jan-2013,t.date_Jul-2013,t.date_Jun-2013,t.date_Mar-2013,t.date_May-2013,t.date_Nov-2012,t.date_Oct-2012,t.date_Sept-2012
294,0.799165,-0.607432,0.671889,0.822989,0.542723,0,0,0,0,0,0,1,0,0,0,0,0
96,-0.968263,-0.800623,1.701089,0.411624,0.65402,0,0,0,0,0,0,0,0,1,0,0,0
377,-1.189192,-0.832782,1.358023,-0.83496,0.291322,1,0,0,0,0,0,0,0,0,0,0,0
89,0.498702,2.24026,-1.386513,-1.656909,-2.008608,0,0,0,0,0,0,1,0,0,0,0,0
233,1.974505,-0.609134,1.701089,0.866702,0.406547,1,0,0,0,0,0,0,0,0,0,0,0


In [11]:
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Checking for Multicollinearity among continuous variables using the Variance Inflation Factor.
# Results show that there is no multicollinearity as the VIF for the continuous variables
# are less than 5.
X = sm.add_constant(x_train.iloc[:,0:5]) 
# Calculate VIF for each predictor variable
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print(vif_data)

     feature       VIF
0      const  1.000000
1      h.age  1.011737
2   dist.mrt  4.233181
3  no.stores  1.656648
4        lat  1.549397
5       long  2.840182


In [15]:
# Training a machine learning model for a regression problem using the x_train dataset and the
# outcome variable y_train.
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge # You can replace Ridge with any other regression model you want to tune
from sklearn.metrics import mean_squared_error
# Assuming you have your features in X and target variable in y

# Define Ridge regression model
ridge = Ridge()

# Define hyperparameters to tune
param_grid = {
    'alpha': [0.01, 0.1, 1.0, 10.0],  # Regularization strength (L2 penalty)
    'solver': ['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']  # Solver options
}
# Perform cross-validation grid search
grid_search = GridSearchCV(estimator=ridge, param_grid=param_grid, cv=5) # cv=5 for 5-fold cross-validation
grid_search.fit(x_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)

# Get the best model
best_model = grid_search.best_estimator_

# Evaluate the best model on the train set using RMSE
y_train_pred = best_model.predict(x_train)
rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)  # RMSE on train set
print("Root Mean Squared Error on train set:", np.round(rmse_train,2))

Best hyperparameters: {'alpha': 10.0, 'solver': 'sparse_cg'}
Root Mean Squared Error on train set: 8.84


In [37]:
# Checking model performance using the weighted mean absolute percentage error.

# Define the actual values (y_true) and predicted values (y_pred) for the test set
# Assuming you have already obtained these from your model
y_true = np.array([y_train])  # Replace [actual_values] with the actual values from your test set
y_pred = np.array([y_train_pred])  # Replace [predicted_values] with the predicted values from your model

# Compute the absolute percentage errors
absolute_percentage_errors = np.abs((y_true - y_pred) / y_true)

# Compute the weighted mean absolute percentage error (WMAPE)
mape = (np.sum(absolute_percentage_errors) / len(y_train)) * 100

print("Weighted Mean Absolute Percentage Error (WMAPE):", np.round(mape,2))
# trying out other models due to terrible performance. 

Weighted Mean Absolute Percentage Error (WMAPE): 17.87


In [24]:
# Workout the transaction date so that the year and the months would be reflected as categorical variable
# instead of a continuous variable.
replace_values = {2013.0833333:'Jan-2013',
                    2013.1666667: 'Feb-2013', 2013.25: 'Mar-2013',
                    2013.3333333: 'Apr-2013', 2013.4166667: 'May-2013',
                    2013.5: 'Jun-2013', 2013.5833333: 'Jul-2013',
                    2012.6666667: 'Aug-2012', 2012.75: 'Sept-2012',
                    2012.8333333: 'Oct-2012', 2012.9166667: 'Nov-2012',
                    2013.0: 'Dec-2012'}   
test = test.replace({"t.date": replace_values}) 

In [25]:
# One-hot the test data
# Creating a dataset for one-hot coding the categorical explanatory variable (t.date).
test_1 = pd.get_dummies(test, columns = ['t.date'], dtype=int)
test_1.head()

Unnamed: 0,h.age,dist.mrt,no.stores,lat,long,price,t.date_Apr-2013,t.date_Aug-2012,t.date_Dec-2012,t.date_Feb-2013,t.date_Jan-2013,t.date_Jul-2013,t.date_Jun-2013,t.date_Mar-2013,t.date_May-2013,t.date_Nov-2012,t.date_Oct-2012,t.date_Sept-2012
356,10.3,211.4473,1,24.97417,121.52999,45.3,0,0,0,0,0,0,0,0,0,0,1,0
170,24.0,4527.687,0,24.94741,121.49628,14.4,1,0,0,0,0,0,0,0,0,0,0,0
224,34.5,324.9419,6,24.97814,121.5417,46.0,1,0,0,0,0,0,0,0,0,0,0,0
331,25.6,4519.69,0,24.94826,121.49587,15.6,1,0,0,0,0,0,0,0,0,0,0,0
306,14.4,169.9803,1,24.97369,121.52979,50.2,0,0,0,0,0,0,1,0,0,0,0,0


In [26]:
# Separating the explanatory variables from the outcome variable.
x_test = test_1.drop(['price'], axis = 1)
y_test = test_1['price']
x_test.head()

Unnamed: 0,h.age,dist.mrt,no.stores,lat,long,t.date_Apr-2013,t.date_Aug-2012,t.date_Dec-2012,t.date_Feb-2013,t.date_Jan-2013,t.date_Jul-2013,t.date_Jun-2013,t.date_Mar-2013,t.date_May-2013,t.date_Nov-2012,t.date_Oct-2012,t.date_Sept-2012
356,10.3,211.4473,1,24.97417,121.52999,0,0,0,0,0,0,0,0,0,0,1,0
170,24.0,4527.687,0,24.94741,121.49628,1,0,0,0,0,0,0,0,0,0,0,0
224,34.5,324.9419,6,24.97814,121.5417,1,0,0,0,0,0,0,0,0,0,0,0
331,25.6,4519.69,0,24.94826,121.49587,1,0,0,0,0,0,0,0,0,0,0,0
306,14.4,169.9803,1,24.97369,121.52979,0,0,0,0,0,0,1,0,0,0,0,0


In [27]:
# Standardize all the continuous variables.
from sklearn.preprocessing import StandardScaler

# Assuming you have your data in a DataFrame called df with continuous variables
# Replace continuous_vars with the names of your continuous variables
continuous_vars = ['h.age', 'dist.mrt', 'no.stores','lat','long']

# Initialize StandardScaler
scaler = StandardScaler()

# Fit scaler to the continuous variables and transform them
x_test[continuous_vars] = scaler.fit_transform(x_test[continuous_vars])
x_test.head(5)

Unnamed: 0,h.age,dist.mrt,no.stores,lat,long,t.date_Apr-2013,t.date_Aug-2012,t.date_Dec-2012,t.date_Feb-2013,t.date_Jan-2013,t.date_Jul-2013,t.date_Jun-2013,t.date_Mar-2013,t.date_May-2013,t.date_Nov-2012,t.date_Oct-2012,t.date_Sept-2012
356,-0.717318,-0.661945,-1.07192,0.457247,-0.255934,0,0,0,0,0,0,0,0,0,0,1,0
170,0.476005,2.81757,-1.405228,-1.895163,-2.439542,1,0,0,0,0,0,0,0,0,0,0,0
224,1.390596,-0.570452,0.594622,0.80624,0.502597,1,0,0,0,0,0,0,0,0,0,0,0
331,0.615371,2.811123,-1.405228,-1.820441,-2.4661,1,0,0,0,0,0,0,0,0,0,0,0
306,-0.360192,-0.695373,-1.07192,0.415051,-0.268889,0,0,0,0,0,0,1,0,0,0,0,0


In [38]:
    from sklearn.model_selection import GridSearchCV
    from sklearn.linear_model import Lasso
    from sklearn.pipeline import Pipeline
    import warnings

# Ignore all warnings
warnings.filterwarnings("ignore")

# Assuming you have your features in X and target variable in y

# Define the Lasso regression model
lasso = Lasso()

# Define hyperparameters to tune
param_grid = {
    'alpha': [0.01, 0.1, 1.0, 10.0]  # Regularization strength
}

# Perform cross-validation grid search
grid_search = GridSearchCV(estimator=lasso, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(x_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)

# Get the best model
best_model = grid_search.best_estimator_

# Evaluate the best model on the train set using RMSE
y_train_pred = best_model.predict(x_train)
rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)  # RMSE on train set
print("Root Mean Squared Error on train set:", np.round(rmse_train,2))

Best hyperparameters: {'alpha': 0.1}
Root Mean Squared Error on train set: 8.87


In [39]:
# Checking model performance using the weighted mean absolute percentage error.

# Define the actual values (y_true) and predicted values (y_pred) for the test set
# Assuming you have already obtained these from your model
y_true = np.array([y_train])  # Replace [actual_values] with the actual values from your test set
y_pred = np.array([y_train_pred])  # Replace [predicted_values] with the predicted values from your model

# Compute the absolute percentage errors
absolute_percentage_errors = np.abs((y_true - y_pred) / y_true)

# Compute the weighted mean absolute percentage error (WMAPE)
mape = (np.sum(absolute_percentage_errors) / len(y_train)) * 100

print("Weighted Mean Absolute Percentage Error (WMAPE):", np.round(mape,2))
# trying out other models due to terrible performance. 

Weighted Mean Absolute Percentage Error (WMAPE): 18.05


In [41]:
# Performing Elastic Net Regression
# Import necessary libraries
import numpy as np
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_regression

# Split the data into training and testing sets
# (You should replace this with your own dataset)
# X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the hyperparameter grid for Elastic Net
parametersGrid = {
    "max_iter": [1, 5, 10],
    "alpha": [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
    "l1_ratio": np.arange(0.0, 1.0, 0.1)
}

# Initialize the Elastic Net model
eNet = ElasticNet()

# Perform grid search to find the best hyperparameters
grid_search  = GridSearchCV(eNet, parametersGrid, scoring='neg_mean_squared_error', cv=5)
grid_search.fit(x_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)

# Get the best model
best_model = grid_search.best_estimator_

# Evaluate the best model on the train set using RMSE
y_train_pred = best_model.predict(x_train)
rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)  # RMSE on train set
print("Root Mean Squared Error on train set:", np.round(rmse_train,2))

Best hyperparameters: {'alpha': 0.1, 'l1_ratio': 0.0, 'max_iter': 5}
Root Mean Squared Error on train set: 8.9


In [42]:
# Checking model performance using the weighted mean absolute percentage error.

# Define the actual values (y_true) and predicted values (y_pred) for the test set
# Assuming you have already obtained these from your model
y_true = np.array([y_train])  # Replace [actual_values] with the actual values from your test set
y_pred = np.array([y_train_pred])  # Replace [predicted_values] with the predicted values from your model

# Compute the absolute percentage errors
absolute_percentage_errors = np.abs((y_true - y_pred) / y_true)

# Compute the weighted mean absolute percentage error (WMAPE)
mape = (np.sum(absolute_percentage_errors) / len(y_train)) * 100

print("Weighted Mean Absolute Percentage Error (WMAPE):", np.round(mape,2))
# trying out other models due to terrible performance. 

Weighted Mean Absolute Percentage Error (WMAPE): 18.02


# Another go... but this time consider t.date as a continuous variable.

In [50]:
train.head()

Unnamed: 0,t.date,h.age,dist.mrt,no.stores,lat,long,price
294,2013.5,26.4,335.5273,6,24.9796,121.5414,38.1
96,2013.416667,6.4,90.45606,9,24.97433,121.5431,59.5
377,2013.333333,3.9,49.66105,8,24.95836,121.53756,56.8
89,2013.5,23.0,3947.945,0,24.94783,121.50243,25.3
233,2013.333333,39.7,333.3679,9,24.98016,121.53932,32.4


In [51]:
# Separating the explanatory variables from the outcome variable.
x_train = train.drop(['price'], axis = 1)
y_train = train['price']
x_train.head()

Unnamed: 0,t.date,h.age,dist.mrt,no.stores,lat,long
294,2013.5,26.4,335.5273,6,24.9796,121.5414
96,2013.416667,6.4,90.45606,9,24.97433,121.5431
377,2013.333333,3.9,49.66105,8,24.95836,121.53756
89,2013.5,23.0,3947.945,0,24.94783,121.50243
233,2013.333333,39.7,333.3679,9,24.98016,121.53932


In [52]:
# Standardize all the continuous variables.
from sklearn.preprocessing import StandardScaler

# Assuming you have your data in a DataFrame called df with continuous variables
# Replace continuous_vars with the names of your continuous variables
continuous_vars = ['t.date', 'h.age', 'dist.mrt', 'no.stores','lat','long']

# Initialize StandardScaler
scaler = StandardScaler()

# Fit scaler to the continuous variables and transform them
x_train[continuous_vars] = scaler.fit_transform(x_train[continuous_vars])
x_train.head(5)

Unnamed: 0,t.date,h.age,dist.mrt,no.stores,lat,long
294,1.242834,0.799165,-0.607432,0.671889,0.822989,0.542723
96,0.947214,-0.968263,-0.800623,1.701089,0.411624,0.65402
377,0.651593,-1.189192,-0.832782,1.358023,-0.83496,0.291322
89,1.242834,0.498702,2.24026,-1.386513,-1.656909,-2.008608
233,0.651593,1.974505,-0.609134,1.701089,0.866702,0.406547


In [55]:
from statsmodels.tools.tools import add_constant
X = add_constant(x_train.iloc[:,0:6]) 
X

Unnamed: 0,const,t.date,h.age,dist.mrt,no.stores,lat,long
294,1.0,1.242834,0.799165,-0.607432,0.671889,0.822989,0.542723
96,1.0,0.947214,-0.968263,-0.800623,1.701089,0.411624,0.654020
377,1.0,0.651593,-1.189192,-0.832782,1.358023,-0.834960,0.291322
89,1.0,1.242834,0.498702,2.240260,-1.386513,-1.656909,-2.008608
233,1.0,0.651593,1.974505,-0.609134,1.701089,0.866702,0.406547
...,...,...,...,...,...,...,...
323,1.0,0.947214,0.993582,-0.716529,0.671889,0.566179,0.736512
192,1.0,0.060352,2.336827,-0.826532,1.014956,-0.121511,0.496240
117,1.0,-0.530890,-0.331989,2.436867,-1.386513,-2.357869,-1.916951
47,1.0,1.538455,1.638693,-0.366832,-0.357312,0.513100,0.264479


In [56]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Checking for Multicollinearity among continuous variables using the Variance Inflation Factor.
# Results show that there is no multicollinearity as the VIF for the continuous variables
# are less than 5.
X = add_constant(x_train.iloc[:,0:6]) 
# Calculate VIF for each predictor variable
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print(vif_data)

     feature       VIF
0      const  1.000000
1     t.date  1.021891
2      h.age  1.013221
3   dist.mrt  4.281389
4  no.stores  1.660112
5        lat  1.566731
6       long  2.840975


In [58]:
# Training a machine learning model for a regression problem using the x_train dataset and the
# outcome variable y_train.
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge # You can replace Ridge with any other regression model you want to tune
from sklearn.metrics import mean_squared_error
# Assuming you have your features in X and target variable in y

# Define Ridge regression model
ridge = Ridge()

# Define hyperparameters to tune
param_grid = {
    'alpha': [0.01, 0.1, 1.0, 10.0],  # Regularization strength (L2 penalty)
    'solver': ['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']  # Solver options
}
# Perform cross-validation grid search
grid_search = GridSearchCV(estimator=ridge, param_grid=param_grid, cv=5) # cv=5 for 5-fold cross-validation
grid_search.fit(x_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)

# Get the best model
best_model = grid_search.best_estimator_

# Evaluate the best model on the train set using RMSE
y_train_pred = best_model.predict(x_train)
rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)  # RMSE on train set
print("Root Mean Squared Error on train set:", np.round(rmse_train,2))

Best hyperparameters: {'alpha': 10.0, 'solver': 'sag'}
Root Mean Squared Error on train set: 8.94


In [59]:
# Checking model performance using the weighted mean absolute percentage error.

# Define the actual values (y_true) and predicted values (y_pred) for the test set
# Assuming you have already obtained these from your model
y_true = np.array([y_train])  # Replace [actual_values] with the actual values from your test set
y_pred = np.array([y_train_pred])  # Replace [predicted_values] with the predicted values from your model

# Compute the absolute percentage errors
absolute_percentage_errors = np.abs((y_true - y_pred) / y_true)

# Compute the weighted mean absolute percentage error (WMAPE)
mape = (np.sum(absolute_percentage_errors) / len(y_train)) * 100

print("Weighted Mean Absolute Percentage Error (WMAPE):", np.round(mape,2))
# trying out other models due to terrible performance. 

Weighted Mean Absolute Percentage Error (WMAPE): 18.11


In [60]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso
from sklearn.pipeline import Pipeline
import warnings

# Ignore all warnings
warnings.filterwarnings("ignore")

# Assuming you have your features in X and target variable in y

# Define the Lasso regression model
lasso = Lasso()

# Define hyperparameters to tune
param_grid = {
    'alpha': [0.01, 0.1, 1.0, 10.0]  # Regularization strength
}

# Perform cross-validation grid search
grid_search = GridSearchCV(estimator=lasso, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(x_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)

# Get the best model
best_model = grid_search.best_estimator_

# Evaluate the best model on the train set using RMSE
y_train_pred = best_model.predict(x_train)
rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)  # RMSE on train set
print("Root Mean Squared Error on train set:", np.round(rmse_train,2))

Best hyperparameters: {'alpha': 0.1}
Root Mean Squared Error on train set: 8.94


In [62]:
# Checking model performance using the weighted mean absolute percentage error.

# Define the actual values (y_true) and predicted values (y_pred) for the test set
# Assuming you have already obtained these from your model
y_true = np.array([y_train])  # Replace [actual_values] with the actual values from your test set
y_pred = np.array([y_train_pred])  # Replace [predicted_values] with the predicted values from your model

# Compute the absolute percentage errors
absolute_percentage_errors = np.abs((y_true - y_pred) / y_true)

# Compute the weighted mean absolute percentage error (WMAPE)
mape = (np.sum(absolute_percentage_errors) / len(y_train)) * 100

print("Weighted Mean Absolute Percentage Error (WMAPE):", np.round(mape,2))
# trying out other models due to terrible performance. 

Weighted Mean Absolute Percentage Error (WMAPE): 18.13


In [63]:
# Performing Elastic Net Regression
# Import necessary libraries
import numpy as np
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_regression

# Split the data into training and testing sets
# (You should replace this with your own dataset)
# X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the hyperparameter grid for Elastic Net
parametersGrid = {
    "max_iter": [1, 5, 10],
    "alpha": [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
    "l1_ratio": np.arange(0.0, 1.0, 0.1)
}

# Initialize the Elastic Net model
eNet = ElasticNet()

# Perform grid search to find the best hyperparameters
grid_search  = GridSearchCV(eNet, parametersGrid, scoring='neg_mean_squared_error', cv=5)
grid_search.fit(x_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)

# Get the best model
best_model = grid_search.best_estimator_

# Evaluate the best model on the train set using RMSE
y_train_pred = best_model.predict(x_train)
rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)  # RMSE on train set
print("Root Mean Squared Error on train set:", np.round(rmse_train,2))

Best hyperparameters: {'alpha': 0.1, 'l1_ratio': 0.2, 'max_iter': 5}
Root Mean Squared Error on train set: 8.95


In [64]:
# Checking model performance using the weighted mean absolute percentage error.

# Define the actual values (y_true) and predicted values (y_pred) for the test set
# Assuming you have already obtained these from your model
y_true = np.array([y_train])  # Replace [actual_values] with the actual values from your test set
y_pred = np.array([y_train_pred])  # Replace [predicted_values] with the predicted values from your model

# Compute the absolute percentage errors
absolute_percentage_errors = np.abs((y_true - y_pred) / y_true)

# Compute the weighted mean absolute percentage error (WMAPE)
mape = (np.sum(absolute_percentage_errors) / len(y_train)) * 100

print("Weighted Mean Absolute Percentage Error (WMAPE):", np.round(mape,2))
# trying out other models due to terrible performance. 

Weighted Mean Absolute Percentage Error (WMAPE): 18.12


# Another go... but this time exclude latitude and longitude.

In [70]:
# Creating a dataset for one-hot coding the categorical explanatory variable (t.date).
train_1 = pd.get_dummies(train, columns = ['t.date'], dtype=int)
train_1.head()

Unnamed: 0,h.age,dist.mrt,no.stores,lat,long,price,t.date_Apr-2013,t.date_Aug-2012,t.date_Dec-2012,t.date_Feb-2013,t.date_Jan-2013,t.date_Jul-2013,t.date_Jun-2013,t.date_Mar-2013,t.date_May-2013,t.date_Nov-2012,t.date_Oct-2012,t.date_Sept-2012
294,26.4,335.5273,6,24.9796,121.5414,38.1,0,0,0,0,0,0,1,0,0,0,0,0
96,6.4,90.45606,9,24.97433,121.5431,59.5,0,0,0,0,0,0,0,0,1,0,0,0
377,3.9,49.66105,8,24.95836,121.53756,56.8,1,0,0,0,0,0,0,0,0,0,0,0
89,23.0,3947.945,0,24.94783,121.50243,25.3,0,0,0,0,0,0,1,0,0,0,0,0
233,39.7,333.3679,9,24.98016,121.53932,32.4,1,0,0,0,0,0,0,0,0,0,0,0


In [71]:
# Separating the explanatory variables from the outcome variable.
x_train = train_1.drop(['price'], axis = 1)
y_train = train_1['price']
x_train.head()

Unnamed: 0,h.age,dist.mrt,no.stores,lat,long,t.date_Apr-2013,t.date_Aug-2012,t.date_Dec-2012,t.date_Feb-2013,t.date_Jan-2013,t.date_Jul-2013,t.date_Jun-2013,t.date_Mar-2013,t.date_May-2013,t.date_Nov-2012,t.date_Oct-2012,t.date_Sept-2012
294,26.4,335.5273,6,24.9796,121.5414,0,0,0,0,0,0,1,0,0,0,0,0
96,6.4,90.45606,9,24.97433,121.5431,0,0,0,0,0,0,0,0,1,0,0,0
377,3.9,49.66105,8,24.95836,121.53756,1,0,0,0,0,0,0,0,0,0,0,0
89,23.0,3947.945,0,24.94783,121.50243,0,0,0,0,0,0,1,0,0,0,0,0
233,39.7,333.3679,9,24.98016,121.53932,1,0,0,0,0,0,0,0,0,0,0,0


In [72]:
# Standardize all the continuous variables.
from sklearn.preprocessing import StandardScaler

# Assuming you have your data in a DataFrame called df with continuous variables
# Replace continuous_vars with the names of your continuous variables
continuous_vars = ['h.age', 'dist.mrt', 'no.stores','lat','long']

# Initialize StandardScaler
scaler = StandardScaler()

# Fit scaler to the continuous variables and transform them
x_train[continuous_vars] = scaler.fit_transform(x_train[continuous_vars])
x_train.head(5)

Unnamed: 0,h.age,dist.mrt,no.stores,lat,long,t.date_Apr-2013,t.date_Aug-2012,t.date_Dec-2012,t.date_Feb-2013,t.date_Jan-2013,t.date_Jul-2013,t.date_Jun-2013,t.date_Mar-2013,t.date_May-2013,t.date_Nov-2012,t.date_Oct-2012,t.date_Sept-2012
294,0.799165,-0.607432,0.671889,0.822989,0.542723,0,0,0,0,0,0,1,0,0,0,0,0
96,-0.968263,-0.800623,1.701089,0.411624,0.65402,0,0,0,0,0,0,0,0,1,0,0,0
377,-1.189192,-0.832782,1.358023,-0.83496,0.291322,1,0,0,0,0,0,0,0,0,0,0,0
89,0.498702,2.24026,-1.386513,-1.656909,-2.008608,0,0,0,0,0,0,1,0,0,0,0,0
233,1.974505,-0.609134,1.701089,0.866702,0.406547,1,0,0,0,0,0,0,0,0,0,0,0


In [73]:
from statsmodels.tools.tools import add_constant
X = add_constant(x_train.iloc[:,0:3]) 
X

Unnamed: 0,const,h.age,dist.mrt,no.stores
294,1.0,0.799165,-0.607432,0.671889
96,1.0,-0.968263,-0.800623,1.701089
377,1.0,-1.189192,-0.832782,1.358023
89,1.0,0.498702,2.240260,-1.386513
233,1.0,1.974505,-0.609134,1.701089
...,...,...,...,...
323,1.0,0.993582,-0.716529,0.671889
192,1.0,2.336827,-0.826532,1.014956
117,1.0,-0.331989,2.436867,-1.386513
47,1.0,1.638693,-0.366832,-0.357312


In [75]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Checking for Multicollinearity among continuous variables using the Variance Inflation Factor.
# Results show that there is no multicollinearity as the VIF for the continuous variables
# are less than 5.
X = add_constant(x_train.iloc[:,0:3]) 
# Calculate VIF for each predictor variable
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print(vif_data)

     feature       VIF
0      const  1.000000
1      h.age  1.002420
2   dist.mrt  1.630521
3  no.stores  1.627700


In [83]:
# Drop the irrelevant feature for developing the machine learning model.
x_train = x_train.drop(['lat', 'long'], axis = 1)
x_train.head()

Unnamed: 0,h.age,dist.mrt,no.stores,t.date_Apr-2013,t.date_Aug-2012,t.date_Dec-2012,t.date_Feb-2013,t.date_Jan-2013,t.date_Jul-2013,t.date_Jun-2013,t.date_Mar-2013,t.date_May-2013,t.date_Nov-2012,t.date_Oct-2012,t.date_Sept-2012
294,0.799165,-0.607432,0.671889,0,0,0,0,0,0,1,0,0,0,0,0
96,-0.968263,-0.800623,1.701089,0,0,0,0,0,0,0,0,1,0,0,0
377,-1.189192,-0.832782,1.358023,1,0,0,0,0,0,0,0,0,0,0,0
89,0.498702,2.24026,-1.386513,0,0,0,0,0,0,1,0,0,0,0,0
233,1.974505,-0.609134,1.701089,1,0,0,0,0,0,0,0,0,0,0,0


In [84]:
# Training a machine learning model for a regression problem using the x_train dataset and the
# outcome variable y_train.
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge # You can replace Ridge with any other regression model you want to tune
from sklearn.metrics import mean_squared_error
# Assuming you have your features in X and target variable in y

# Define Ridge regression model
ridge = Ridge()

# Define hyperparameters to tune
param_grid = {
    'alpha': [0.01, 0.1, 1.0, 10.0],  # Regularization strength (L2 penalty)
    'solver': ['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']  # Solver options
}
# Perform cross-validation grid search
grid_search = GridSearchCV(estimator=ridge, param_grid=param_grid, cv=5) # cv=5 for 5-fold cross-validation
grid_search.fit(x_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)

# Get the best model
best_model = grid_search.best_estimator_

# Evaluate the best model on the train set using RMSE
y_train_pred = best_model.predict(x_train)
rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)  # RMSE on train set
print("Root Mean Squared Error on train set:", np.round(rmse_train,2))

Best hyperparameters: {'alpha': 10.0, 'solver': 'saga'}
Root Mean Squared Error on train set: 9.14


In [85]:
# Checking model performance using the weighted mean absolute percentage error.

# Define the actual values (y_true) and predicted values (y_pred) for the test set
# Assuming you have already obtained these from your model
y_true = np.array([y_train])  # Replace [actual_values] with the actual values from your test set
y_pred = np.array([y_train_pred])  # Replace [predicted_values] with the predicted values from your model

# Compute the absolute percentage errors
absolute_percentage_errors = np.abs((y_true - y_pred) / y_true)

# Compute the weighted mean absolute percentage error (WMAPE)
mape = (np.sum(absolute_percentage_errors) / len(y_train)) * 100

print("Weighted Mean Absolute Percentage Error (WMAPE):", np.round(mape,2))
# trying out other models due to terrible performance. 

Weighted Mean Absolute Percentage Error (WMAPE): 19.02


In [86]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso
from sklearn.pipeline import Pipeline
import warnings

# Ignore all warnings
warnings.filterwarnings("ignore")

# Assuming you have your features in X and target variable in y

# Define the Lasso regression model
lasso = Lasso()

# Define hyperparameters to tune
param_grid = {
    'alpha': [0.01, 0.1, 1.0, 10.0]  # Regularization strength
}

# Perform cross-validation grid search
grid_search = GridSearchCV(estimator=lasso, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(x_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)

# Get the best model
best_model = grid_search.best_estimator_

# Evaluate the best model on the train set using RMSE
y_train_pred = best_model.predict(x_train)
rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)  # RMSE on train set
print("Root Mean Squared Error on train set:", np.round(rmse_train,2))

Best hyperparameters: {'alpha': 0.1}
Root Mean Squared Error on train set: 9.17


In [87]:
# Checking model performance using the weighted mean absolute percentage error.

# Define the actual values (y_true) and predicted values (y_pred) for the test set
# Assuming you have already obtained these from your model
y_true = np.array([y_train])  # Replace [actual_values] with the actual values from your test set
y_pred = np.array([y_train_pred])  # Replace [predicted_values] with the predicted values from your model

# Compute the absolute percentage errors
absolute_percentage_errors = np.abs((y_true - y_pred) / y_true)

# Compute the weighted mean absolute percentage error (WMAPE)
mape = (np.sum(absolute_percentage_errors) / len(y_train)) * 100

print("Weighted Mean Absolute Percentage Error (WMAPE):", np.round(mape,2))
# trying out other models due to terrible performance. 

Weighted Mean Absolute Percentage Error (WMAPE): 19.23


In [88]:
# Performing Elastic Net Regression
# Import necessary libraries
import numpy as np
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_regression

# Split the data into training and testing sets
# (You should replace this with your own dataset)
# X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the hyperparameter grid for Elastic Net
parametersGrid = {
    "max_iter": [1, 5, 10],
    "alpha": [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
    "l1_ratio": np.arange(0.0, 1.0, 0.1)
}

# Initialize the Elastic Net model
eNet = ElasticNet()

# Perform grid search to find the best hyperparameters
grid_search  = GridSearchCV(eNet, parametersGrid, scoring='neg_mean_squared_error', cv=5)
grid_search.fit(x_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)

# Get the best model
best_model = grid_search.best_estimator_

# Evaluate the best model on the train set using RMSE
y_train_pred = best_model.predict(x_train)
rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)  # RMSE on train set
print("Root Mean Squared Error on train set:", np.round(rmse_train,2))

Best hyperparameters: {'alpha': 0.1, 'l1_ratio': 0.0, 'max_iter': 5}
Root Mean Squared Error on train set: 9.22


In [89]:
# Checking model performance using the weighted mean absolute percentage error.

# Define the actual values (y_true) and predicted values (y_pred) for the test set
# Assuming you have already obtained these from your model
y_true = np.array([y_train])  # Replace [actual_values] with the actual values from your test set
y_pred = np.array([y_train_pred])  # Replace [predicted_values] with the predicted values from your model

# Compute the absolute percentage errors
absolute_percentage_errors = np.abs((y_true - y_pred) / y_true)

# Compute the weighted mean absolute percentage error (WMAPE)
mape = (np.sum(absolute_percentage_errors) / len(y_train)) * 100

print("Weighted Mean Absolute Percentage Error (WMAPE):", np.round(mape,2))
# trying out other models due to terrible performance. 

Weighted Mean Absolute Percentage Error (WMAPE): 19.18


### From the exploration, it would seem that having the date as a categorical variable and including the latitude and longitude lead to better regression model performance. 

# Try out more other models for regression such as Random Forest and Extreme Gradient Boost.

In [13]:
x_train.head()

Unnamed: 0,h.age,dist.mrt,no.stores,lat,long,t.date_Apr-2013,t.date_Aug-2012,t.date_Dec-2012,t.date_Feb-2013,t.date_Jan-2013,t.date_Jul-2013,t.date_Jun-2013,t.date_Mar-2013,t.date_May-2013,t.date_Nov-2012,t.date_Oct-2012,t.date_Sept-2012
294,0.799165,-0.607432,0.671889,0.822989,0.542723,0,0,0,0,0,0,1,0,0,0,0,0
96,-0.968263,-0.800623,1.701089,0.411624,0.65402,0,0,0,0,0,0,0,0,1,0,0,0
377,-1.189192,-0.832782,1.358023,-0.83496,0.291322,1,0,0,0,0,0,0,0,0,0,0,0
89,0.498702,2.24026,-1.386513,-1.656909,-2.008608,0,0,0,0,0,0,1,0,0,0,0,0
233,1.974505,-0.609134,1.701089,0.866702,0.406547,1,0,0,0,0,0,0,0,0,0,0,0


In [14]:
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error

# Define the XGBoost regressor
xgb_regressor = xgb.XGBRegressor()

# Define hyperparameters to tune
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 4, 5],
    'learning_rate': [0.01, 0.1, 0.3],
    'gamma': [0, 0.1, 0.2],
    'subsample': [0.8, 0.9, 1.0],
    'colsample_bytree': [0.8, 0.9, 1.0],
    'reg_alpha': [0, 0.1, 0.5],
    'reg_lambda': [0, 0.1, 0.5]
}

# Perform cross-validation grid search
grid_search = GridSearchCV(estimator=xgb_regressor, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(x_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)

# Get the best model
best_model = grid_search.best_estimator_

# Evaluate the best model on the train set using RMSE
y_train_pred = best_model.predict(x_train)
rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)  # RMSE on train set
print("Root Mean Squared Error on train set:", np.round(rmse_train,2))

Best hyperparameters: {'colsample_bytree': 1.0, 'gamma': 0, 'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 300, 'reg_alpha': 0, 'reg_lambda': 0.1, 'subsample': 0.8}
Root Mean Squared Error on train set: 3.21


In [67]:
# Checking model performance using the weighted mean absolute percentage error.

# Define the actual values (y_true) and predicted values (y_pred) for the test set
# Assuming you have already obtained these from your model
y_true = np.array([y_train])  # Replace [actual_values] with the actual values from your test set
y_pred = np.array([y_train_pred])  # Replace [predicted_values] with the predicted values from your model

# Compute the absolute percentage errors
absolute_percentage_errors = np.abs((y_true - y_pred) / y_true)

# Compute the weighted mean absolute percentage error (WMAPE)
mape = (np.sum(absolute_percentage_errors) / len(y_train)) * 100

print("Weighted Mean Absolute Percentage Error (WMAPE):", np.round(mape,2))
# trying out other models due to terrible performance.

Weighted Mean Absolute Percentage Error (WMAPE): 7.22


In [None]:
import pickle
# Save the model using pickle.
pickle.dump(best_model, open('REmodel.pkl', 'wb'))

## Use the best model to predict real estate price on the test set.

In [50]:
# Workout the transaction date so that the year and the months would be reflected as categorical variable
# instead of a continuous variable.
replace_values = {2013.0833333:'Jan-2013',
                    2013.1666667: 'Feb-2013', 2013.25: 'Mar-2013',
                    2013.3333333: 'Apr-2013', 2013.4166667: 'May-2013',
                    2013.5: 'Jun-2013', 2013.5833333: 'Jul-2013',
                    2012.6666667: 'Aug-2012', 2012.75: 'Sept-2012',
                    2012.8333333: 'Oct-2012', 2012.9166667: 'Nov-2012',
                    2013.0: 'Dec-2012'}   
test = test.replace({"t.date": replace_values}) 

In [35]:
test.head()

Unnamed: 0,t.date,h.age,dist.mrt,no.stores,lat,long,price
356,Oct-2012,10.3,211.4473,1,24.97417,121.52999,45.3
170,Apr-2013,24.0,4527.687,0,24.94741,121.49628,14.4
224,Apr-2013,34.5,324.9419,6,24.97814,121.5417,46.0
331,Apr-2013,25.6,4519.69,0,24.94826,121.49587,15.6
306,Jun-2013,14.4,169.9803,1,24.97369,121.52979,50.2


In [51]:
# One-hot the test data
# Creating a dataset for one-hot coding the categorical explanatory variable (t.date).
test_1 = pd.get_dummies(test, columns = ['t.date'], dtype=int)
test_1.head()

Unnamed: 0,h.age,dist.mrt,no.stores,lat,long,price,t.date_Apr-2013,t.date_Aug-2012,t.date_Dec-2012,t.date_Feb-2013,t.date_Jan-2013,t.date_Jul-2013,t.date_Jun-2013,t.date_Mar-2013,t.date_May-2013,t.date_Nov-2012,t.date_Oct-2012,t.date_Sept-2012
356,10.3,211.4473,1,24.97417,121.52999,45.3,0,0,0,0,0,0,0,0,0,0,1,0
170,24.0,4527.687,0,24.94741,121.49628,14.4,1,0,0,0,0,0,0,0,0,0,0,0
224,34.5,324.9419,6,24.97814,121.5417,46.0,1,0,0,0,0,0,0,0,0,0,0,0
331,25.6,4519.69,0,24.94826,121.49587,15.6,1,0,0,0,0,0,0,0,0,0,0,0
306,14.4,169.9803,1,24.97369,121.52979,50.2,0,0,0,0,0,0,1,0,0,0,0,0


In [52]:
# Separating the explanatory variables from the outcome variable.
x_test = test_1.drop(['price'], axis = 1)
y_test = test_1['price']
x_test.head()

Unnamed: 0,h.age,dist.mrt,no.stores,lat,long,t.date_Apr-2013,t.date_Aug-2012,t.date_Dec-2012,t.date_Feb-2013,t.date_Jan-2013,t.date_Jul-2013,t.date_Jun-2013,t.date_Mar-2013,t.date_May-2013,t.date_Nov-2012,t.date_Oct-2012,t.date_Sept-2012
356,10.3,211.4473,1,24.97417,121.52999,0,0,0,0,0,0,0,0,0,0,1,0
170,24.0,4527.687,0,24.94741,121.49628,1,0,0,0,0,0,0,0,0,0,0,0
224,34.5,324.9419,6,24.97814,121.5417,1,0,0,0,0,0,0,0,0,0,0,0
331,25.6,4519.69,0,24.94826,121.49587,1,0,0,0,0,0,0,0,0,0,0,0
306,14.4,169.9803,1,24.97369,121.52979,0,0,0,0,0,0,1,0,0,0,0,0


In [31]:
y_test

356    45.3
170    14.4
224    46.0
331    15.6
306    50.2
       ... 
353    31.3
81     36.8
107    26.6
362    40.0
410    50.0
Name: price, Length: 125, dtype: float64

In [53]:
# Standardize all the continuous variables.
from sklearn.preprocessing import StandardScaler

# Assuming you have your data in a DataFrame called df with continuous variables
# Replace continuous_vars with the names of your continuous variables
continuous_vars = ['h.age', 'dist.mrt', 'no.stores','lat','long']

# Initialize StandardScaler
scaler = StandardScaler()

# Fit scaler to the continuous variables and transform them
x_test[continuous_vars] = scaler.fit_transform(x_test[continuous_vars])
x_test.head(5)

Unnamed: 0,h.age,dist.mrt,no.stores,lat,long,t.date_Apr-2013,t.date_Aug-2012,t.date_Dec-2012,t.date_Feb-2013,t.date_Jan-2013,t.date_Jul-2013,t.date_Jun-2013,t.date_Mar-2013,t.date_May-2013,t.date_Nov-2012,t.date_Oct-2012,t.date_Sept-2012
356,-0.717318,-0.661945,-1.07192,0.457247,-0.255934,0,0,0,0,0,0,0,0,0,0,1,0
170,0.476005,2.81757,-1.405228,-1.895163,-2.439542,1,0,0,0,0,0,0,0,0,0,0,0
224,1.390596,-0.570452,0.594622,0.80624,0.502597,1,0,0,0,0,0,0,0,0,0,0,0
331,0.615371,2.811123,-1.405228,-1.820441,-2.4661,1,0,0,0,0,0,0,0,0,0,0,0
306,-0.360192,-0.695373,-1.07192,0.415051,-0.268889,0,0,0,0,0,0,1,0,0,0,0,0


In [65]:
# Load the pickled model.
# Estimate house price using the xgboost model.
pickled_model = pickle.load(open('REmodel.pkl', 'rb'))
pickled_model.predict(x_test)

array([79.73015 , 16.55483 , 38.929295, 16.411913, 43.69424 , 38.145706,
       41.962368, 37.823513, 53.479973, 42.84342 , 45.780434, 30.305021,
       80.6664  , 42.50713 , 58.60952 , 46.810673, 37.097977, 46.891563,
       40.08247 , 42.20813 , 52.903477, 23.921297, 35.527668, 44.61505 ,
       51.55138 , 41.683056, 41.66734 , 25.792072, 49.50788 , 24.516224,
       41.59101 , 27.716534, 78.886154, 45.010315, 41.549374, 23.737482,
       41.91754 , 28.928562, 48.73149 , 15.240374, 48.365765, 38.131805,
       26.400988, 44.79006 , 20.07199 , 42.35769 , 38.742367, 16.91329 ,
       27.506887, 47.101143, 53.76049 , 38.293823, 42.552387, 20.51382 ,
       23.183434, 37.179096, 50.775166, 40.59088 , 44.5257  , 26.245016,
       38.592743, 47.12448 , 39.802456, 53.844906, 41.204666, 27.077793,
       16.701069, 28.602749, 50.120506, 38.145706, 24.657223, 49.247917,
       44.302334, 26.092432, 41.395836, 30.27972 , 28.561125, 17.097267,
       44.295395, 35.343163, 43.71268 , 23.008585, 

In [57]:
# Evaluate the best model on the test set using RMSE.
y_test_pred = pickled_model.predict(x_test)
rmse_test = mean_squared_error(y_test, y_test_pred, squared=False)  # RMSE on train set
print("Root Mean Squared Error on test set:", np.round(rmse_test,2))

Root Mean Squared Error on test set: 9.74


In [58]:
# Checking model performance using the weighted mean absolute percentage error.

# Define the actual values (y_true) and predicted values (y_pred) for the test set
# Assuming you have already obtained these from your model
y_true = np.array([y_test])  # Replace [actual_values] with the actual values from your test set
y_pred = np.array([y_test_pred])  # Replace [predicted_values] with the predicted values from your model

# Compute the absolute percentage errors
absolute_percentage_errors = np.abs((y_true - y_pred) / y_true)

# Compute the weighted mean absolute percentage error (WMAPE)
mape = (np.sum(absolute_percentage_errors) / len(y_test)) * 100

print("Weighted Mean Absolute Percentage Error (WMAPE):", np.round(mape,2))
# trying out other models due to terrible performance.

Weighted Mean Absolute Percentage Error (WMAPE): 18.56


In [61]:
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.ensemble import RandomForestRegressor

# Define Random Forest regressor
rf_regressor = RandomForestRegressor()

# Define hyperparameters grid
param_grid = {
    'n_estimators': [10, 50, 100],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Define GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(estimator=rf_regressor, param_grid=param_grid, cv=5)

# Perform GridSearchCV
grid_search.fit(x_train, y_train)

# Print best hyperparameters
print("Best Hyperparameters:", grid_search.best_params_)
# Get the best model
best_model = grid_search.best_estimator_
# Evaluate the best model on the train set using RMSE
y_train_pred = best_model.predict(x_train)
rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)  # RMSE on train set
print("Root Mean Squared Error on train set:", np.round(rmse_train,2))

Best Hyperparameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 50}
Root Mean Squared Error on train set: 3.58


In [68]:
# Evaluate the best model on the train set using RMSE
y_train_pred = best_model.predict(x_train)
rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)  # RMSE on train set
print("Root Mean Squared Error on train set:", np.round(rmse_train,2))

Root Mean Squared Error on train set: 3.58


In [69]:
# Checking model performance using the weighted mean absolute percentage error.

# Define the actual values (y_true) and predicted values (y_pred) for the test set
# Assuming you have already obtained these from your model
y_true = np.array([y_train])  # Replace [actual_values] with the actual values from your test set
y_pred = np.array([y_train_pred])  # Replace [predicted_values] with the predicted values from your model

# Compute the absolute percentage errors
absolute_percentage_errors = np.abs((y_true - y_pred) / y_true)

# Compute the weighted mean absolute percentage error (WMAPE)
mape = (np.sum(absolute_percentage_errors) / len(y_train)) * 100

print("Weighted Mean Absolute Percentage Error (WMAPE):", np.round(mape,2))
# trying out other models due to terrible performance.

Weighted Mean Absolute Percentage Error (WMAPE): 6.55


In [78]:
# Save a copy of the Random Forest Model.
pickle.dump(best_model, open('RFmodel.pkl', 'wb'))

In [81]:
# Load the Random Forest Model.
RFpickled_model = pickle.load(open('RFmodel.pkl', 'rb'))
RF_pred = RFpickled_model.predict(x_test)

In [72]:
# Evaluate the best model on the test set using RMSE
y_test_pred = best_model.predict(x_test)
rmse_test = mean_squared_error(y_test, y_test_pred, squared=False)  # RMSE on train set
print("Root Mean Squared Error on test set:", np.round(rmse_test,2))

Root Mean Squared Error on test set: 9.15


In [73]:
# Checking model performance using the weighted mean absolute percentage error.

# Define the actual values (y_true) and predicted values (y_pred) for the test set
# Assuming you have already obtained these from your model
y_true = np.array([y_test])  # Replace [actual_values] with the actual values from your test set
y_pred = np.array([y_test_pred])  # Replace [predicted_values] with the predicted values from your model

# Compute the absolute percentage errors
absolute_percentage_errors = np.abs((y_true - y_pred) / y_true)

# Compute the weighted mean absolute percentage error (WMAPE)
mape = (np.sum(absolute_percentage_errors) / len(y_test)) * 100

print("Weighted Mean Absolute Percentage Error (WMAPE):", np.round(mape,2))
# trying out other models due to terrible performance.

Weighted Mean Absolute Percentage Error (WMAPE): 17.67


Performance of the Extreme Gradient Boost Model
Train Dataset
Root Mean Squared Error on train set: 3.21
Weighted Mean Absolute Percentage Error (WMAPE): 7.22

Test Dataset
Root Mean Squared Error on test set: 9.74
Weighted Mean Absolute Percentage Error (WMAPE): 18.56


Performance of the Random Forest Regressor Model
Train Dataset
Root Mean Squared Error on train set: 3.58
Weighted Mean Absolute Percentage Error (WMAPE): 6.55

Test Dataset
Root Mean Squared Error on test set: 9.15
Weighted Mean Absolute Percentage Error (WMAPE): 17.67

Looking into different machine learning models for a regression problem predicting house price, we have found the Random Forest Regressor model to be performing well having an RMSE of 3.58 and MAPE of 6.55. The same model performs well on the test.   