### Modeling: Predicting Median Rent from Yelp Metrics

In this section we explore options for regression models that could predict an area's median rent based on the proportion of Yelp businesses of each price tier. Regression models explored as options are Linear Regression, LASSO, Ridge, Decision Tree Regression, and K Nearest Neighbors Regression. Metrics used for assessment are $R^2$ and RMSE. 

#### Imports 

In [1]:
import pandas as pd 
import numpy as np 

from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV
from sklearn.ensemble import AdaBoostRegressor, BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor

from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV

In [2]:
df = pd.read_csv("../datasets/generated_gps_price_radius.csv")

df.drop(columns = ["neighborhood", "median income", "median home value", "index"], inplace = True)

In [3]:
X_one_mi = df[["1.0mi 1 dollar", "1.0mi 2 dollar", "1.0mi 3 dollar", "1.0mi 4 dollar"]]

X_half_mi = df[["0.5mi 1 dollar", "0.5mi 2 dollar", "0.5mi 3 dollar", "0.5mi 4 dollar"]]

X_all = df[["0.5mi 1 dollar", "0.5mi 2 dollar", "0.5mi 3 dollar", "0.5mi 4 dollar", 
            "1.0mi 1 dollar", "1.0mi 2 dollar", "1.0mi 3 dollar", "1.0mi 4 dollar"]]

y = df["median rent"]


In [4]:
def ultrafit(X_train, X_test, y_train, y_test, model, grid = False, params = None): 
    
    if grid: 
        print("Gridsearching...")
        
        griddle = GridSearchCV(model, 
                               param_grid = params, 
                               cv = 5)
        griddle.fit(X_train, y_train)
        print("Model has been fit.")
        
        mod = griddle.best_estimator_ 
        
        print(f"Best Parameters: \n{griddle.best_params_}")
        
    else: 
        mod = model.fit(X_train, y_train)
        print("Model has been fit.")

    y_train_preds = mod.predict(X_train)
    y_test_preds = mod.predict(X_test)

    RMSE_train = round(mean_squared_error(y_train, y_train_preds)**0.5, 2)
    RMSE_test = round(mean_squared_error(y_test, y_test_preds)**0.5, 2)

    r2_train = r2_score(y_train, y_train_preds)
    r2_test = r2_score(y_test, y_test_preds)
    
    print("----")
    print("Metrics:")
    print(f"Train RMSE = {RMSE_train} \nTest RMSE = {RMSE_test}")
    print("----")
    print(f"Train R2 score = {r2_train} \nTest R2 score = {r2_test}")
    
    return mod, RMSE_train, RMSE_test, r2_train, r2_test
    

#### Train/Test Split 

In [5]:
X_one_train, X_one_test, y_one_train, y_one_test = train_test_split(X_one_mi, y, 
                                                                    test_size = 0.3, 
                                                                    random_state = 42)

X_half_train, X_half_test, y_half_train, y_half_test = train_test_split(X_half_mi, y, 
                                                                        test_size = 0.3, 
                                                                        random_state = 42)

X_all_train, X_all_test, y_all_train, y_all_test = train_test_split(X_all, y, 
                                                                    test_size = 0.3, 
                                                                    random_state = 42)

X_trains = [X_one_train, X_half_train, X_all_train]
X_tests = [X_one_test, X_half_test, X_all_test]

y_trains = [y_one_train, y_half_train, y_all_train]
y_tests = [y_one_test, y_half_test, y_all_test]

names = ["One Mile Radius (exclusive)", "Half Mile Radius", "One Mile Radius (inclusive)"]

In [6]:
rmse_train = []
rmse_test = []
r2_train = []
r2_test = []

#### Model: Linear Regression

In [7]:
lr_one, rmse_tr, rmse_te, r2_tr, r2_te = ultrafit(X_trains[0], X_tests[0], y_trains[0], 
                                                  y_tests[0], LinearRegression())

rmse_train.append(rmse_tr)
rmse_test.append(rmse_te)
r2_train.append(r2_tr)
r2_test.append(r2_te)

Model has been fit.
----
Metrics:
Train RMSE = 221.7 
Test RMSE = 220.03
----
Train R2 score = 0.12547868466311107 
Test R2 score = 0.0909107925942293


In [8]:
lr_half, rmse_tr, rmse_te, r2_tr, r2_te = ultrafit(X_trains[1], X_tests[1], y_trains[1], y_tests[1], 
                                                   LinearRegression())

rmse_train.append(rmse_tr)
rmse_test.append(rmse_te)
r2_train.append(r2_tr)
r2_test.append(r2_te)

Model has been fit.
----
Metrics:
Train RMSE = 234.61 
Test RMSE = 229.2
----
Train R2 score = 0.020670344391434448 
Test R2 score = 0.013606058830050394


In [9]:
lr_all, rmse_tr, rmse_te, r2_tr, r2_te = ultrafit(X_trains[2], X_tests[2], y_trains[2], y_tests[2], 
                                                  LinearRegression())

rmse_train.append(rmse_tr)
rmse_test.append(rmse_te)
r2_train.append(r2_tr)
r2_test.append(r2_te)

Model has been fit.
----
Metrics:
Train RMSE = 220.27 
Test RMSE = 219.92
----
Train R2 score = 0.1366994875364993 
Test R2 score = 0.09181308023560053


#### Model: LASSO 

In [10]:
lasso_one, rmse_tr, rmse_te, r2_tr, r2_te = ultrafit(X_trains[0], X_tests[0], y_trains[0], y_tests[0], 
                                                     LassoCV(cv = 5))

rmse_train.append(rmse_tr)
rmse_test.append(rmse_te)
r2_train.append(r2_tr)
r2_test.append(r2_te)

Model has been fit.
----
Metrics:
Train RMSE = 221.7 
Test RMSE = 220.03
----
Train R2 score = 0.12547482525097142 
Test R2 score = 0.09097211832084506


In [11]:
lasso_half, rmse_tr, rmse_te, r2_tr, r2_te = ultrafit(X_trains[1], X_tests[1], y_trains[1], y_tests[1], 
                                                      LassoCV(cv = 5))

rmse_train.append(rmse_tr)
rmse_test.append(rmse_te)
r2_train.append(r2_tr)
r2_test.append(r2_te)

Model has been fit.
----
Metrics:
Train RMSE = 234.74 
Test RMSE = 229.28
----
Train R2 score = 0.019538226314473883 
Test R2 score = 0.012913520046618387


In [12]:
lasso_all, rmse_tr, rmse_te, r2_tr, r2_te = ultrafit(X_trains[2], X_tests[2], y_trains[2], y_tests[2], 
                                                     LassoCV(cv = 5))

rmse_train.append(rmse_tr)
rmse_test.append(rmse_te)
r2_train.append(r2_tr)
r2_test.append(r2_te)

Model has been fit.
----
Metrics:
Train RMSE = 220.27 
Test RMSE = 219.9
----
Train R2 score = 0.13669130245664152 
Test R2 score = 0.09202069068388985


#### Model: Ridge

In [13]:
ridge_one, rmse_tr, rmse_te, r2_tr, r2_te = ultrafit(X_trains[0], X_tests[0], y_trains[0], y_tests[0], 
                                                     RidgeCV(cv = 5))

rmse_train.append(rmse_tr)
rmse_test.append(rmse_te)
r2_train.append(r2_tr)
r2_test.append(r2_te)

Model has been fit.
----
Metrics:
Train RMSE = 223.53 
Test RMSE = 221.63
----
Train R2 score = 0.11093510190921807 
Test R2 score = 0.07764980318798431


In [14]:
ridge_half, rmse_tr, rmse_te, r2_tr, r2_te = ultrafit(X_trains[1], X_tests[1], y_trains[1], y_tests[1], 
                                                      RidgeCV(cv = 5))

rmse_train.append(rmse_tr)
rmse_test.append(rmse_te)
r2_train.append(r2_tr)
r2_test.append(r2_te)

Model has been fit.
----
Metrics:
Train RMSE = 234.67 
Test RMSE = 229.29
----
Train R2 score = 0.020162154991251402 
Test R2 score = 0.012778569977973775


In [15]:
ridge_all, rmse_tr, rmse_te, r2_tr, r2_te = ultrafit(X_trains[2], X_tests[2], y_trains[2], y_tests[2], 
                                                     RidgeCV(cv = 5))

rmse_train.append(rmse_tr)
rmse_test.append(rmse_te)
r2_train.append(r2_tr)
r2_test.append(r2_te)

Model has been fit.
----
Metrics:
Train RMSE = 222.24 
Test RMSE = 221.36
----
Train R2 score = 0.12118653814510805 
Test R2 score = 0.07992246676737869




#### Model: Decision Tree Regressor

In [16]:
dtree_params = {
    "max_depth": [3, 5, 10, 12, None]
}

dtree_one, rmse_tr, rmse_te, r2_tr, r2_te = ultrafit(X_trains[0], X_tests[0], y_trains[0], y_tests[0], 
                                                     DecisionTreeRegressor(), grid = True, params = dtree_params)

rmse_train.append(rmse_tr)
rmse_test.append(rmse_te)
r2_train.append(r2_tr)
r2_test.append(r2_te)

Gridsearching...
Model has been fit.
Best Parameters: 
{'max_depth': 3}
----
Metrics:
Train RMSE = 201.17 
Test RMSE = 208.34
----
Train R2 score = 0.2798969614151099 
Test R2 score = 0.18496965723263947




In [17]:
dtree_half, rmse_tr, rmse_te, r2_tr, r2_te = ultrafit(X_trains[1], X_tests[1], y_trains[1], y_tests[1], 
                                                      DecisionTreeRegressor(), grid = True, params = dtree_params)

rmse_train.append(rmse_tr)
rmse_test.append(rmse_te)
r2_train.append(r2_tr)
r2_test.append(r2_te)

Gridsearching...
Model has been fit.
Best Parameters: 
{'max_depth': 3}
----
Metrics:
Train RMSE = 223.54 
Test RMSE = 229.01
----
Train R2 score = 0.11085408589213352 
Test R2 score = 0.015217762993154027


In [18]:
dtree_all, rmse_tr, rmse_te, r2_tr, r2_te = ultrafit(X_trains[2], X_tests[2], y_trains[2], y_tests[2], 
                                                     DecisionTreeRegressor(), grid = True, params = dtree_params)

rmse_train.append(rmse_tr)
rmse_test.append(rmse_te)
r2_train.append(r2_tr)
r2_test.append(r2_te)

Gridsearching...
Model has been fit.
Best Parameters: 
{'max_depth': 3}
----
Metrics:
Train RMSE = 200.53 
Test RMSE = 210.5
----
Train R2 score = 0.2845255331635723 
Test R2 score = 0.16797169547259472




#### Model: K Nearest-Neighbor Regressor

In [19]:
knn_params = {
    "n_neighbors": [3, 5, 7, 10, 12], 
    "weights": ["uniform", "distance"]
}

knn_one, rmse_tr, rmse_te, r2_tr, r2_te = ultrafit(X_trains[0], X_tests[0], y_trains[0], y_tests[0], 
                                                   KNeighborsRegressor(), grid = True, params = knn_params)

rmse_train.append(rmse_tr)
rmse_test.append(rmse_te)
r2_train.append(r2_tr)
r2_test.append(r2_te)

Gridsearching...
Model has been fit.
Best Parameters: 
{'n_neighbors': 12, 'weights': 'uniform'}
----
Metrics:
Train RMSE = 195.8 
Test RMSE = 199.14
----
Train R2 score = 0.317838251118267 
Test R2 score = 0.255357800917727




In [20]:
knn_half, rmse_tr, rmse_te, r2_tr, r2_te = ultrafit(X_trains[1], X_tests[1], y_trains[1], y_tests[1], 
                                                    KNeighborsRegressor(), grid = True, params = knn_params)

rmse_train.append(rmse_tr)
rmse_test.append(rmse_te)
r2_train.append(r2_tr)
r2_test.append(r2_te)

Gridsearching...
Model has been fit.
Best Parameters: 
{'n_neighbors': 12, 'weights': 'uniform'}
----
Metrics:
Train RMSE = 230.43 
Test RMSE = 236.08
----
Train R2 score = 0.055227139737464515 
Test R2 score = -0.046505825487675345


In [21]:
knn_all, rmse_tr, rmse_te, r2_tr, r2_te = ultrafit(X_trains[2], X_tests[2], y_trains[2], y_tests[2], KNeighborsRegressor(), grid = True, params = knn_params)

rmse_train.append(rmse_tr)
rmse_test.append(rmse_te)
r2_train.append(r2_tr)
r2_test.append(r2_te)

Gridsearching...
Model has been fit.
Best Parameters: 
{'n_neighbors': 12, 'weights': 'uniform'}
----
Metrics:
Train RMSE = 200.07 
Test RMSE = 216.5
----
Train R2 score = 0.2877627191240799 
Test R2 score = 0.1198926730614932




#### Model: AdaBoost Regressor

In [22]:
ada_params = {
    "base_estimator": [DecisionTreeRegressor(max_depth = 3), DecisionTreeRegressor(max_depth = 5), 
                       DecisionTreeRegressor(max_depth = 7), DecisionTreeRegressor(max_depth = 10), 
                       DecisionTreeRegressor(max_depth = None)], 
    "n_estimators": [20, 30, 50, 60, 70]
}

ada_one, rmse_tr, rmse_te, r2_tr, r2_te = ultrafit(X_trains[0], X_tests[0], y_trains[0], y_tests[0], AdaBoostRegressor(), grid = True, params = ada_params)

rmse_train.append(rmse_tr)
rmse_test.append(rmse_te)
r2_train.append(r2_tr)
r2_test.append(r2_te)

Gridsearching...
Model has been fit.
Best Parameters: 
{'base_estimator': DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best'), 'n_estimators': 70}
----
Metrics:
Train RMSE = 200.66 
Test RMSE = 205.97
----
Train R2 score = 0.2835700411503941 
Test R2 score = 0.20344554617282762




In [23]:
ada_half, rmse_tr, rmse_te, r2_tr, r2_te = ultrafit(X_trains[1], X_tests[1], y_trains[1], y_tests[1], 
                                                    AdaBoostRegressor(), grid = True, params = ada_params)

rmse_train.append(rmse_tr)
rmse_test.append(rmse_te)
r2_train.append(r2_tr)
r2_test.append(r2_te)

Gridsearching...
Model has been fit.
Best Parameters: 
{'base_estimator': DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best'), 'n_estimators': 70}
----
Metrics:
Train RMSE = 226.46 
Test RMSE = 225.13
----
Train R2 score = 0.08745778869146226 
Test R2 score = 0.04828851502540954




In [24]:
ada_all, rmse_tr, rmse_te, r2_tr, r2_te = ultrafit(X_trains[2], X_tests[2], y_trains[2], y_tests[2], 
                                                   AdaBoostRegressor(), grid = True, params = ada_params)

rmse_train.append(rmse_tr)
rmse_test.append(rmse_te)
r2_train.append(r2_tr)
r2_test.append(r2_te)

Gridsearching...
Model has been fit.
Best Parameters: 
{'base_estimator': DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best'), 'n_estimators': 20}
----
Metrics:
Train RMSE = 197.49 
Test RMSE = 207.73
----
Train R2 score = 0.3060284023344334 
Test R2 score = 0.18974405921184756




#### Model: Bagged Decision Tree Regressor

In [25]:
bag_params = {
    "n_estimators": [5, 10, 15, 20, 25], 
    "max_features": [0.3, 0.5, 0.7, 1.0], 
    "bootstrap_features": [True, False]
}

bag_one, rmse_tr, rmse_te, r2_tr, r2_te = ultrafit(X_trains[0], X_tests[0], y_trains[0], y_tests[0], 
                                                   BaggingRegressor(), grid = True, params = bag_params)

rmse_train.append(rmse_tr)
rmse_test.append(rmse_te)
r2_train.append(r2_tr)
r2_test.append(r2_te)

Gridsearching...
Model has been fit.
Best Parameters: 
{'bootstrap_features': False, 'max_features': 0.7, 'n_estimators': 25}
----
Metrics:
Train RMSE = 169.99 
Test RMSE = 192.95
----
Train R2 score = 0.485864059564181 
Test R2 score = 0.3009408254411252




In [26]:
bag_half, rmse_tr, rmse_te, r2_tr, r2_te = ultrafit(X_trains[1], X_tests[1], y_trains[1], y_tests[1], 
                                                    BaggingRegressor(), grid = True, params = bag_params)

rmse_train.append(rmse_tr)
rmse_test.append(rmse_te)
r2_train.append(r2_tr)
r2_test.append(r2_te)

Gridsearching...
Model has been fit.
Best Parameters: 
{'bootstrap_features': False, 'max_features': 0.7, 'n_estimators': 25}
----
Metrics:
Train RMSE = 211.26 
Test RMSE = 226.55
----
Train R2 score = 0.20586007975639553 
Test R2 score = 0.03630277123777159


In [27]:
bag_al, rmse_tr, rmse_te, r2_tr, r2_tel = ultrafit(X_trains[2], X_tests[2], y_trains[2], y_tests[2], 
                                                   BaggingRegressor(), grid = True, params = bag_params)

rmse_train.append(rmse_tr)
rmse_test.append(rmse_te)
r2_train.append(r2_tr)
r2_test.append(r2_te)

Gridsearching...
Model has been fit.
Best Parameters: 
{'bootstrap_features': True, 'max_features': 1.0, 'n_estimators': 15}
----
Metrics:
Train RMSE = 150.71 
Test RMSE = 195.61
----
Train R2 score = 0.59583645745188 
Test R2 score = 0.2815092713296621




#### Observations

In [33]:
metrics = {
    "Model": ["lr_one", "lr_half", "lr_all", "lasso_one", "lasso_half", "lasso_all", "ridge_one", "ridge_half", 
              "ridge_all", "dtree_one", "dtree_half", "dtree_all", "knn_one", "knn_half", "knn_all", "ada_one", 
              "ada_half", "ada_all", "bag_one", "bag_half", "bag_all"], 
    "RMSE Train": rmse_train, 
    "RMSE Test": rmse_test, 
    "R2 Train": r2_train, 
    "R2 Test": r2_test
}

metrics_df = pd.DataFrame(metrics)

metrics_df["RMSE Difference"] = metrics_df["RMSE Train"] - metrics_df["RMSE Test"]

metrics_df.sort_values(by = ["RMSE Train", "RMSE Difference", "R2 Train"], ascending = True)

Unnamed: 0,Model,RMSE Train,RMSE Test,R2 Train,R2 Test,RMSE Difference
20,bag_all,150.71,195.61,0.595836,0.036303,-44.9
18,bag_one,169.99,192.95,0.485864,0.300941,-22.96
12,knn_one,195.8,199.14,0.317838,0.255358,-3.34
17,ada_all,197.49,207.73,0.306028,0.189744,-10.24
14,knn_all,200.07,216.5,0.287763,0.119893,-16.43
11,dtree_all,200.53,210.5,0.284526,0.167972,-9.97
15,ada_one,200.66,205.97,0.28357,0.203446,-5.31
9,dtree_one,201.17,208.34,0.279897,0.18497,-7.17
19,bag_half,211.26,226.55,0.20586,0.036303,-15.29
2,lr_all,220.27,219.92,0.136699,0.091813,0.35


Considering the performance metrics displayed above, it appears that the "best" model (or least bad) to use for predicting median rent from Yelp data is a **KNearestNeighbors Regressor model**. In this case, the best performing model was the one that only took Yelp data into account when it applied to businesses that were between 0.5 mi - 1.0 mi away from a given location. While other models had lower RMSE's, the difference between the train and test RMSE's were the highest of all models. The KNearestNeighbors Regressor model had the third lowest RMSE with training data, a difference between train RMSE and test RMSE that was less than 5, and also had a $R^2$ score that was considerably better than other models tested. 