Future Month Models

The purpose of this section is to see how well simple regression models can predict the next months housing value. These models will include the current months price as a variable in addition to the crime values.

In [59]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge, Lasso
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import VotingRegressor

from hyperopt import hp, tpe, fmin, anneal, Trials
import hpsklearn

In [3]:
#Read modeling data
model_df2 = pd.read_csv("model_data2.csv")
pd.set_option('display.max_columns', 500)

In [6]:
#Target Column
target_column = 'FUTURE MHV'

In [7]:
#Train-Test Split
X = model_df2.loc[:,"CRIMINAL DAMAGE ADJ":"MHV"]
y = model_df2[[target_column]]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=18)

The only changes to the Current_Models notebook is that we are predicting on "FUTURE MHV" and including the current "MHV" as a predictor. For these models we will not use the adjusted MHV as this value takes into consideration the average of all other community areas in Chicago. If this is to be of real world usefullness, you would not know what the average housing value of all other community areas would be in the future.

In [8]:
#Scale Data
scaler = StandardScaler()
scaled_X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
scaled_X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)
scaled_X = pd.DataFrame(scaler.transform(X), columns=X.columns)

We will first establish a baseline. The baseline will be using the current months housing value to predict the next months

In [61]:
#Baseline RMSE and R-squared
r2 = r2_score(model_df2["FUTURE MHV"], model_df2["MHV"])
print("Mean Absolute Error: {}".format(r2))
rmse = np.sqrt(mean_squared_error(model_df2["FUTURE MHV"], model_df2["MHV"]))
print("Root Mean Squared Error: {}".format(rmse))

Mean Absolute Error: 0.9995375029474216
Root Mean Squared Error: 1.9473158170946396


This provides a good baseline for RMSE and R^2. However, what we are perhaps more interested in from a business perspective is whether the price is going to go up or down. Just using the previous months price would provide no predictive power to the direction. Therefore, I will write a function below to measure how well models predict the directionality of the next months price.

In [54]:
def results_df(y_test=y_test, y_pred=y_pred, data=model_df2):
    results_df = y_test.copy()
    results_df["Pred"] = y_pred
    results_df["MHV"] = model_df2.loc[list(y_test.index),"MHV"]
    results_df["pred hi low"] = results_df.apply(lambda row: 1 if row["Pred"] > row["MHV"] else 0, axis=1)
    results_df["real hi low"] = results_df.apply(lambda row: 1 if row["FUTURE MHV"] > row["MHV"] else 0, axis=1)
    results_df["correct"] = results_df.apply(lambda row: 1 if row["pred hi low"] == row["real hi low"] else 0, axis=1)
    results_df["real dif"] = abs(results_df["FUTURE MHV"] - results_df["MHV"])
    results_df["pred dif"] = abs(results_df["Pred"] - results_df["MHV"])
    results_df["ae"] = abs(results_df["Pred"] - results_df["FUTURE MHV"])
    
    return results_df

def percent_correct_directionality(results_df, min_dif=0):
    return (results_df.loc[results_df["pred dif"] > min_dif]["correct"].mean() * 100)

In [123]:
#Baseline directionality
up_down = []
for i, x in enumerate(list(X_test["MHV"])):
    if list(y_test["FUTURE MHV"])[i] > x:
        up_down.append(1)
    else:
        up_down.append(0)

sum(up_down) / len(up_down) * 100

54.53474676089517

A quick test of the future prices compared to the current prices show that the prices go up 54.5% of the time in the dataset. Therefore a model that predicted it to go up 100% of the time would be 54.5% accurate. This provides a good baseline for the ability of the model to predict the direction of price movement.

Use hyperopt to tune parameters of random forest, I expect these parameters may be different than before since we are including a highly predictive column (the last months housing value)

In [9]:
#Define function to minimize
def rf_mse_cv(params, random_state=42, cv=3, X=scaled_X_train, y=y_train):
    
    params = {'n_estimators': int(params['n_estimators']), 
              'max_features': int(params['max_features'])} 
    model = RandomForestRegressor(criterion="mse", random_state=42, **params)
    score = -cross_val_score(model, X, y, cv=cv, scoring="neg_mean_squared_error", n_jobs=-1).mean()

    return score

In [10]:
%%time

space={'n_estimators': hp.uniform('n_estimators', 10, 1500),
       'max_features' : hp.uniform('max_features', 2, 36)
      }

trials = Trials()

best=fmin(fn=rf_mse_cv,
          space=space, 
          algo=tpe.suggest,
          max_evals=10,
          trials=trials,
          rstate=np.random.RandomState(42)
         )

100%|██████████| 10/10 [06:57<00:00, 47.33s/it, best loss: 3.3381939722723346]
CPU times: user 152 ms, sys: 199 ms, total: 351 ms
Wall time: 6min 58s


In [11]:
best

{'max_features': 25.82112911712644, 'n_estimators': 1433.6621996017927}

In [47]:
#Create RF model using best parameters and test
rf = RandomForestRegressor(random_state=42,
                           criterion="mse",
                           n_estimators=1433,
                           max_features=25)

rf_model = rf.fit(scaled_X_train,y_train.values.ravel())
y_pred = rf_model.predict(scaled_X_test)

print("R^2: {}".format(rf_model.score(scaled_X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))
rf_df = results_df(y_pred=y_pred)
print("Directionality %: {}".format(percent_correct_directionality(rf_df)))

R^2: 0.9996352833925544
Root Mean Squared Error: 1.689352278356195
Directionality %: 69.84687868080094


By all measures the random forest model does perform better than baseline

Using hyperopt for GB regression:

In [13]:
#Define function to minimize
def gb_mse_cv(params, random_state=42, cv=3, X=scaled_X_train, y=y_train):
    
    params = {'n_estimators': int(params['n_estimators']), 
              'max_features': int(params['max_features']), 
             'learning_rate': params['learning_rate']}
    model = GradientBoostingRegressor(loss='ls', random_state=42, **params)
    score = -cross_val_score(model, X, y, cv=cv, scoring="neg_mean_squared_error", n_jobs=-1).mean()

    return score

In [14]:
%%time

space={'n_estimators': hp.uniform('n_estimators', 10, 10000),
       'max_features' : hp.uniform('max_features', 2, 36),
       'learning_rate': hp.uniform('learning_rate', 0.01, 0.99)
      }

trials = Trials()

best=fmin(fn=gb_mse_cv, 
          space=space, 
          algo=tpe.suggest,
          max_evals=10,
          trials=trials,
          rstate=np.random.RandomState(42)
         )

100%|██████████| 10/10 [09:20<00:00, 56.23s/it, best loss: 2.844469211607817]
CPU times: user 159 ms, sys: 242 ms, total: 401 ms
Wall time: 9min 20s


In [15]:
best

{'learning_rate': 0.11276197538164948,
 'max_features': 33.23656573675881,
 'n_estimators': 4453.876797888506}

In [48]:
#Create GB model using best parameters and test
gb = GradientBoostingRegressor(random_state=42, 
                               n_estimators=4453,
                               max_features=33,
                               learning_rate=0.11276197538164948)

gb_model = gb.fit(scaled_X_train, y_train.values.ravel())
y_pred = gb_model.predict(scaled_X_test)

print("R^2: {}".format(gb_model.score(scaled_X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))
gb_df = results_df(y_pred=y_pred)
print("Directionality %: {}".format(percent_correct_directionality(gb_df)))

R^2: 0.9997370381029932
Root Mean Squared Error: 1.4344613930280252
Directionality %: 75.97173144876325


In [105]:
#KNN Regression
knn = KNeighborsRegressor(n_neighbors = 1)

knn_model = knn.fit(scaled_X_train, y_train.values.ravel())
y_pred = knn_model.predict(scaled_X_test)

print("R^2: {}".format(knn_model.score(scaled_X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))
knn_df = results_df(y_pred=y_pred)
print("Directionality %: {}".format(percent_correct_directionality(knn_df, min_dif=-1)))

R^2: 0.9979442257170397
Root Mean Squared Error: 4.0107922452317135
Directionality %: 66.70592854338437


In [50]:
#Linear Regression
lr = LinearRegression()
lr_model = lr.fit(scaled_X_train, y_train.values.ravel())
y_pred = lr_model.predict(scaled_X_test)

print("R^2: {}".format(lr_model.score(scaled_X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))
lr_df = results_df(y_pred=y_pred)
print("Directionality %: {}".format(percent_correct_directionality(lr_df)))

R^2: 0.999600253445323
Root Mean Squared Error: 1.7686211605906303
Directionality %: 61.75893207695328


In [51]:
#Voting Regressor
vr = VotingRegressor(estimators=[('rf', rf), ('knn', knn), ('gb', gb), ('lr', lr)])
vr_model = vr.fit(scaled_X_train, y_train.values.ravel())
y_pred = vr_model.predict(scaled_X_test)

print("R^2: {}".format(vr_model.score(scaled_X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))
vr_df = results_df(y_pred=y_pred)
print("Directionality %: {}".format(percent_correct_directionality(vr_df)))

R^2: 0.9997059709212114
Root Mean Squared Error: 1.516832389521937
Directionality %: 78.955634079309


In [53]:
#Voting Regressor (w/o KNN)
vr2 = VotingRegressor(estimators=[('rf', rf), ('gb', gb), ('lr', lr)])
vr2_model = vr2.fit(scaled_X_train, y_train.values.ravel())
y_pred = vr2_model.predict(scaled_X_test)

print("R^2: {}".format(vr2_model.score(scaled_X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))
vr2_df = results_df(y_pred=y_pred)
print("Directionality %: {}".format(percent_correct_directionality(vr2_df)))

R^2: 0.9997268918677652
Root Mean Squared Error: 1.4618734147538048
Directionality %: 76.05025520219867


The voting regressor with all models obtained the highest % Directionality, whereas the GB model scored the best RMSE.