# Regularized Regression Project

Build a Ridge, Lasso, and ElasticNet models that predict the `price` column in the dataset on San Francisco Apartment rentals. Make sure to go through all the the relevant steps of the modelling workflow.

1. Use the model you built for the prior project as the basis for comparison. Does regularization improve fit?
2. Feel free to skip the EDA and checking of assumptions again
2. Engineer (or un-engineer previously) engineered Features as needed
3. Fit a Lasso, Ridge, and Elastic Net Regression using the features in your original model.
4. Once you are ready, fit your final model and report final model performance estimate by scoring on the test data. Report both test R-squared and MAE.
5. What happens to your error if you only model apartments <= 6000 in price... should we do this?

Advice:

1. Remember, regularization doesn't always help, but it can, especially if you let it choose features for you!

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
from sklearn.model_selection import KFold
from sklearn.metrics import r2_score as r2, mean_absolute_error as mae, mean_squared_error as mse

rentals_df = pd.read_csv("./Data/sf_clean.csv") #.query("price <= 6000")

rentals_df.head()

Unnamed: 0,price,sqft,beds,bath,laundry,pets,housing_type,parking,hood_district
0,6800,1600.0,2.0,2.0,(a) in-unit,(d) no pets,(c) multi,(b) protected,7.0
1,3500,550.0,1.0,1.0,(a) in-unit,(a) both,(c) multi,(b) protected,7.0
2,5100,1300.0,2.0,1.0,(a) in-unit,(a) both,(c) multi,(d) no parking,7.0
3,9000,3500.0,3.0,2.5,(a) in-unit,(d) no pets,(c) multi,(b) protected,7.0
4,3100,561.0,1.0,1.0,(c) no laundry,(a) both,(c) multi,(d) no parking,7.0


### Data Dictionary

1. Price: The price of the rental and our target variable
2. sqft: The area in square feet of the rental
3. beds: The number of bedrooms in the rental
4. bath: The number of bathrooms in the rental
5. laundry: Does the rental have a laundry machine inside the house, a shared laundry machine, or no laundry on site?
6. pets: Does the rental allow pets? Cats only, dogs only or both cats and dogs?
7. Housing type: Is the rental in a multi-unit building, a building with two units, or a stand alone house? 
8. Parking: Does the apartment off a parking space? No, protected in a garage, off-street in a parking lot, or valet service?
9. Hood district: Which part of San Francisco is the apartment located?

![image info](SFAR_map.png)

## Feature Engineering

In [4]:
laundry_map = {
    "(a) in-unit": "in_unit",
    "(b) on-site": "not_in_unit",
    "(c) no laundry": "not_in_unit",
}

pet_map = {
    "(a) both": "allows_dogs",
    "(b) dogs": "allows_dogs",
    "(c) cats": "no_dogs",
    "(d) no pets": "no_dogs"
}


housing_type_map = {
    "(a) single": "single",
    "(b) double": "multi",
    "(c) multi": "multi",
}

district_map = {
    1.0: "west",
    2.0: "southwest",
    3.0: "southwest",
    4.0: "central",
    5.0: "central",
    6.0: "central",
    7.0: "marina",
    8.0: "north beach",
    9.0: "FiDi/SOMA",
    10.0: "southwest"
}

In [5]:
eng_df = rentals_df.assign(
#     hood_district = rentals_df["hood_district"].map(district_map),
#     housing_type = rentals_df["housing_type"].map(housing_type_map),
#     pets = rentals_df["pets"].map(pet_map),
#     laundry = rentals_df["laundry"].map(laundry_map),
    sqft2 = rentals_df["sqft"] ** 2,
    sqft3 = rentals_df["sqft"] ** 3,
    beds2 = rentals_df["beds"] ** 2,
    beds3 = rentals_df["beds"] ** 3,
    bath2 = rentals_df["bath"] ** 2,
    bath3 = rentals_df["bath"] ** 3,
    beds_bath_ratio = rentals_df["beds"] / rentals_df["bath"]
)

eng_df = pd.get_dummies(eng_df, drop_first=True)

In [6]:
eng_df.head()

Unnamed: 0,price,sqft,beds,bath,hood_district,sqft2,sqft3,beds2,beds3,bath2,...,laundry_(b) on-site,laundry_(c) no laundry,pets_(b) dogs,pets_(c) cats,pets_(d) no pets,housing_type_(b) double,housing_type_(c) multi,parking_(b) protected,parking_(c) off-street,parking_(d) no parking
0,6800,1600.0,2.0,2.0,7.0,2560000.0,4096000000.0,4.0,8.0,4.0,...,0,0,0,0,1,0,1,1,0,0
1,3500,550.0,1.0,1.0,7.0,302500.0,166375000.0,1.0,1.0,1.0,...,0,0,0,0,0,0,1,1,0,0
2,5100,1300.0,2.0,1.0,7.0,1690000.0,2197000000.0,4.0,8.0,1.0,...,0,0,0,0,0,0,1,0,0,1
3,9000,3500.0,3.0,2.5,7.0,12250000.0,42875000000.0,9.0,27.0,6.25,...,0,0,0,0,1,0,1,1,0,0
4,3100,561.0,1.0,1.0,7.0,314721.0,176558500.0,1.0,1.0,1.0,...,0,1,0,0,0,0,1,0,0,1


## Data Splitting

In [9]:
from sklearn.model_selection import train_test_split

target = "price"
drop_cols = [
#     "pets_no_dogs",
#     "housing_type_single"
]

X = sm.add_constant(eng_df.drop([target] + drop_cols, axis=1))

# Log transform slightly improves normality
y = np.log(eng_df[target])
# y = eng_df[target]

# Test Split
X, X_test, y, y_test = train_test_split(X, y, test_size=.2, random_state=2023)

## Scaling Data

In [10]:
from sklearn.preprocessing import StandardScaler

std = StandardScaler()
X_tr = std.fit_transform(X.values)
X_te = std.transform(X_test.values)

## Ridge

In [11]:
from sklearn.linear_model import RidgeCV

n_alphas = 100
alphas = 10 ** np.linspace(-3, 3, n_alphas)

ridge_model = RidgeCV(alphas=alphas, cv=5)

ridge_model.fit(X_tr, y)
print(f"Cross Val R2: {ridge_model.score(X_tr, y)}")
print(f"Cross Val MAE: {mae(np.exp(y), np.exp(ridge_model.predict(X_tr)))}")
print(f"Alpha: {ridge_model.alpha_}")

Cross Val R2: 0.8074196428452629
Cross Val MAE: 486.99997261615124
Alpha: 0.1519911082952933


In [12]:
list(zip(X.columns, ridge_model.coef_))

[('const', 0.0),
 ('sqft', 0.5053754824577108),
 ('beds', 0.06239517124426997),
 ('bath', 0.13703589203885377),
 ('hood_district', -0.006670843978687262),
 ('sqft2', -0.44805497363278146),
 ('sqft3', 0.17215428128398313),
 ('beds2', -0.17443411652574),
 ('beds3', 0.11220830969721955),
 ('bath2', -0.14200368344490824),
 ('bath3', 0.09260847075261638),
 ('beds_bath_ratio', 0.06972641132036851),
 ('laundry_(b) on-site', -0.03830402957354986),
 ('laundry_(c) no laundry', -0.03437272044157739),
 ('pets_(b) dogs', 0.007612351646764956),
 ('pets_(c) cats', -0.0033671477792682554),
 ('pets_(d) no pets', -0.007584338831835654),
 ('housing_type_(b) double', -0.010907884951135902),
 ('housing_type_(c) multi', 0.01726385381226986),
 ('parking_(b) protected', -0.08254452742391606),
 ('parking_(c) off-street', -0.027381739479437058),
 ('parking_(d) no parking', -0.1093503776918138)]

## Lasso

In [13]:
from sklearn.linear_model import LassoCV

n_alphas = 200
alphas = 10 ** np.linspace(-2, 3, n_alphas)

lasso_model = LassoCV(alphas=alphas, cv=5)

lasso_model.fit(X_tr, y)

print(f"Cross Val R2: {lasso_model.score(X_tr, y)}")
print(f"Cross Val MAE: {mae(np.exp(y), np.exp(lasso_model.predict(X_tr)))}")
print(f"Alpha: {lasso_model.alpha_}")

Cross Val R2: 0.7823247446809933
Cross Val MAE: 520.2849889610231
Alpha: 0.01


In [14]:
list(zip(X.columns, lasso_model.coef_))

[('const', 0.0),
 ('sqft', 0.21437017549295648),
 ('beds', 0.0),
 ('bath', 0.07681231295459999),
 ('hood_district', 0.0),
 ('sqft2', -0.0),
 ('sqft3', -0.0068392197744254),
 ('beds2', -0.0),
 ('beds3', -0.0),
 ('bath2', 0.0),
 ('bath3', 0.0),
 ('beds_bath_ratio', 0.07760738987505986),
 ('laundry_(b) on-site', -0.0383880319635714),
 ('laundry_(c) no laundry', -0.03055694147043361),
 ('pets_(b) dogs', 0.0019526372729564022),
 ('pets_(c) cats', -0.0),
 ('pets_(d) no pets', -8.830885520107288e-05),
 ('housing_type_(b) double', -0.004230549530136255),
 ('housing_type_(c) multi', 0.018677682700661485),
 ('parking_(b) protected', -0.0),
 ('parking_(c) off-street', 0.0),
 ('parking_(d) no parking', -0.03548950276343871)]

## ENET

In [15]:
from sklearn.linear_model import ElasticNetCV

alphas = 10 ** np.linspace(-2, 3, 200)
l1_ratios = np.linspace(.01, 1, 100)

enet_model = ElasticNetCV(alphas=alphas, l1_ratio=l1_ratios, cv=5)

enet_model.fit(X_tr, y)

print(f"Cross Val R2: {enet_model.score(X_tr, y)}")
print(f"Cross Val MAE: {mae(np.exp(y), np.exp(enet_model.predict(X_tr)))}")
print(f"Alpha: {enet_model.alpha_}")
print(f"L1_Ratio: {enet_model.l1_ratio_}")

Cross Val R2: 0.8011930622763215
Cross Val MAE: 493.70745181736663
Alpha: 0.01
L1_Ratio: 0.01


## Final Model Test

In [16]:
print(f"Test MAE: {mae(np.exp(y_test), np.exp(ridge_model.predict(X_te)))}")
print(f"Test R2: {r2(y_test, ridge_model.predict(X_te))}")

Test MAE: 445.49613251160423
Test R2: 0.7813116247107397
