### Modeling: Predicting Median Rent from Yelp Metrics

In this section we explore options for regression models that could predict an area's median rent based on the proportion of Yelp businesses of each price tier. This section is the same as the other modeling for median rent prediction, though this section is modeling based off a dataset that was developed by sourcing surrounding Yelp businesses by location rather than pulling Yelp businesses from a pre-existing dataset. Only a one mile radius is used due to the overall better performance of one-mile radii in the previous modeling. 4-$ tier restaurants were excluded here due to rarity. Regression models explored as options are Linear Regression, LASSO, Ridge, Decision Tree Regression, and K Nearest Neighbors Regression. Metrics used for assessment are $R^2$ and RMSE. 

#### Imports 

In [1]:
import pandas as pd 
import numpy as np 

from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV
from sklearn.ensemble import AdaBoostRegressor, BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor

from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV

In [2]:
df = pd.read_csv("../Richard_392_dem.csv")

df = df.loc[:, ["1", "2", "3", "1p", "2p", "3p", "MEDIAN_GROSS_RENT"]]

In [3]:
df.head()

Unnamed: 0,1,2,3,1p,2p,3p,MEDIAN_GROSS_RENT
0,30,24,0,0.555556,0.444444,0.0,988
1,18,18,0,0.5,0.5,0.0,1490
2,38,110,4,0.245161,0.709677,0.025806,1576
3,6,9,0,0.4,0.6,0.0,1596
4,23,60,1,0.267442,0.697674,0.011628,1576


In [4]:
X = df.drop(columns= ["MEDIAN_GROSS_RENT"])

y = df["MEDIAN_GROSS_RENT"]


In [5]:
def ultrafit(X_train, X_test, y_train, y_test, model, grid = False, params = None): 
    
    if grid: 
        print("Gridsearching...")
        
        griddle = GridSearchCV(model, 
                               param_grid = params, 
                               cv = 5)
        griddle.fit(X_train, y_train)
        print("Model has been fit.")
        
        mod = griddle.best_estimator_ 
        
        print(f"Best Parameters: \n{griddle.best_params_}")
        
    else: 
        mod = model.fit(X_train, y_train)
        print("Model has been fit.")

    y_train_preds = mod.predict(X_train)
    y_test_preds = mod.predict(X_test)

    RMSE_train = round(mean_squared_error(y_train, y_train_preds)**0.5, 2)
    RMSE_test = round(mean_squared_error(y_test, y_test_preds)**0.5, 2)

    r2_train = r2_score(y_train, y_train_preds)
    r2_test = r2_score(y_test, y_test_preds)
    
    print("----")
    print("Metrics:")
    print(f"Train RMSE = {RMSE_train} \nTest RMSE = {RMSE_test}")
    print("----")
    print(f"Train R2 score = {r2_train} \nTest R2 score = {r2_test}")
    
    return mod, RMSE_train, RMSE_test, r2_train, r2_test
    

#### Train/Test Split 

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

In [7]:
rmse_train = []
rmse_test = []
r2_train = []
r2_test = []

#### Model: Linear Regression

In [8]:
lr_one, rmse_tr, rmse_te, r2_tr, r2_te = ultrafit(X_train, X_test, y_train, 
                                                  y_test, LinearRegression())

rmse_train.append(rmse_tr)
rmse_test.append(rmse_te)
r2_train.append(r2_tr)
r2_test.append(r2_te)

Model has been fit.
----
Metrics:
Train RMSE = 199.79 
Test RMSE = 188.17
----
Train R2 score = 0.2698889480964076 
Test R2 score = 0.22712789599113725


#### Model: LASSO 

In [9]:
lasso_one, rmse_tr, rmse_te, r2_tr, r2_te = ultrafit(X_train, X_test, y_train, y_test, 
                                                     LassoCV(cv = 5))

rmse_train.append(rmse_tr)
rmse_test.append(rmse_te)
r2_train.append(r2_tr)
r2_test.append(r2_te)

Model has been fit.
----
Metrics:
Train RMSE = 207.5 
Test RMSE = 187.14
----
Train R2 score = 0.21245330943633733 
Test R2 score = 0.23557043828247826


#### Model: Ridge

In [10]:
ridge_one, rmse_tr, rmse_te, r2_tr, r2_te = ultrafit(X_train, X_test, y_train, y_test, 
                                                     RidgeCV(cv = 5))

rmse_train.append(rmse_tr)
rmse_test.append(rmse_te)
r2_train.append(r2_tr)
r2_test.append(r2_te)

Model has been fit.
----
Metrics:
Train RMSE = 201.53 
Test RMSE = 185.01
----
Train R2 score = 0.25712205724575177 
Test R2 score = 0.25283268575402273




#### Model: Decision Tree Regressor

In [11]:
dtree_params = {
    "max_depth": [3, 5, 10, 12, None]
}

dtree_one, rmse_tr, rmse_te, r2_tr, r2_te = ultrafit(X_train, X_test, y_train, y_test, 
                                                     DecisionTreeRegressor(), grid = True, params = dtree_params)

rmse_train.append(rmse_tr)
rmse_test.append(rmse_te)
r2_train.append(r2_tr)
r2_test.append(r2_te)

Gridsearching...
Model has been fit.
Best Parameters: 
{'max_depth': 3}
----
Metrics:
Train RMSE = 176.61 
Test RMSE = 188.09
----
Train R2 score = 0.42952425635082503 
Test R2 score = 0.22772652334849364




#### Model: K Nearest-Neighbor Regressor

In [12]:
knn_params = {
    "n_neighbors": [3, 5, 7, 10, 12], 
    "weights": ["uniform", "distance"]
}

knn_one, rmse_tr, rmse_te, r2_tr, r2_te = ultrafit(X_train, X_test, y_train, y_test, 
                                                   KNeighborsRegressor(), grid = True, params = knn_params)

rmse_train.append(rmse_tr)
rmse_test.append(rmse_te)
r2_train.append(r2_tr)
r2_test.append(r2_te)

Gridsearching...
Model has been fit.
Best Parameters: 
{'n_neighbors': 7, 'weights': 'uniform'}
----
Metrics:
Train RMSE = 181.54 
Test RMSE = 198.79
----
Train R2 score = 0.39717670478394707 
Test R2 score = 0.13742712324205608




#### Model: AdaBoost Regressor

In [13]:
ada_params = {
    "base_estimator": [DecisionTreeRegressor(max_depth = 3), DecisionTreeRegressor(max_depth = 5), 
                       DecisionTreeRegressor(max_depth = 7), DecisionTreeRegressor(max_depth = 10), 
                       DecisionTreeRegressor(max_depth = None)], 
    "n_estimators": [20, 30, 50, 60, 70]
}

ada_one, rmse_tr, rmse_te, r2_tr, r2_te = ultrafit(X_train, X_test, y_train, y_test, 
                                                   AdaBoostRegressor(), grid = True, params = ada_params)

rmse_train.append(rmse_tr)
rmse_test.append(rmse_te)
r2_train.append(r2_tr)
r2_test.append(r2_te)

Gridsearching...
Model has been fit.
Best Parameters: 
{'base_estimator': DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best'), 'n_estimators': 60}
----
Metrics:
Train RMSE = 134.47 
Test RMSE = 198.28
----
Train R2 score = 0.6692826686328045 
Test R2 score = 0.14177000287411745




#### Model: Bagged Decision Tree Regressor

In [14]:
bag_params = {
    "n_estimators": [5, 10, 15, 20, 25], 
    "max_features": [0.3, 0.5, 0.7, 1.0], 
    "bootstrap_features": [True, False]
}

bag_one, rmse_tr, rmse_te, r2_tr, r2_te = ultrafit(X_train, X_test, y_train, y_test, 
                                                   BaggingRegressor(), grid = True, params = bag_params)

rmse_train.append(rmse_tr)
rmse_test.append(rmse_te)
r2_train.append(r2_tr)
r2_test.append(r2_te)

Gridsearching...
Model has been fit.
Best Parameters: 
{'bootstrap_features': True, 'max_features': 0.7, 'n_estimators': 15}
----
Metrics:
Train RMSE = 88.3 
Test RMSE = 200.11
----
Train R2 score = 0.8573890600203509 
Test R2 score = 0.12586040555634692




#### Observations

In [16]:
metrics = {
    "Model": ["lr_one", "lasso_one", "ridge_one", "dtree_one", "knn_one", "ada_one", "bag_one"], 
    "RMSE Train": rmse_train, 
    "RMSE Test": rmse_test, 
    "R2 Train": r2_train, 
    "R2 Test": r2_test
}

metrics_df = pd.DataFrame(metrics)

metrics_df["RMSE Difference"] = metrics_df["RMSE Train"] - metrics_df["RMSE Test"]

metrics_df.sort_values(by = ["RMSE Train", "RMSE Difference", "R2 Train"], ascending = True)

Unnamed: 0,Model,RMSE Train,RMSE Test,R2 Train,R2 Test,RMSE Difference
6,bag_one,88.3,200.11,0.857389,0.12586,-111.81
5,ada_one,134.47,198.28,0.669283,0.14177,-63.81
3,dtree_one,176.61,188.09,0.429524,0.227727,-11.48
4,knn_one,181.54,198.79,0.397177,0.137427,-17.25
0,lr_one,199.79,188.17,0.269889,0.227128,11.62
2,ridge_one,201.53,185.01,0.257122,0.252833,16.52
1,lasso_one,207.5,187.14,0.212453,0.23557,20.36


In this case the better performing models could be considered the **Multiple Linear Regression Models (Linear Regression, LASSO, and Ridge)**. Although they did have the higher RMSE of the models, the difference in RMSE and $R^2$ scores between the training and test models were much less stark than other models. The other models tested had a significant problem with overfitting, so despite good performance with training data the performance with test data was often at the same level as the multiple linear regression models (metrics-wise). 