<a href="https://colab.research.google.com/github/rgolds5/DS-Unit-2-Linear-Models/blob/master/Module_2_Doing_Linear_Regression_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install category_encoders

Collecting category_encoders
[?25l  Downloading https://files.pythonhosted.org/packages/6e/a1/f7a22f144f33be78afeb06bfa78478e8284a64263a3c09b1ef54e673841e/category_encoders-2.0.0-py2.py3-none-any.whl (87kB)
[K     |████████████████████████████████| 92kB 5.7MB/s 
Installing collected packages: category-encoders
Successfully installed category-encoders-2.0.0


In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LinearRegression
import category_encoders as ce
from sklearn.preprocessing import StandardScaler

model = LinearRegression()
encoder = ce.OneHotEncoder(use_cat_names = True)
scaler = StandardScaler()

data_url = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/nyc/nyc-rent-2016.csv'

df = pd.read_csv(data_url)
assert df.shape == (48300, 34)



In [70]:
df.created = pd.to_datetime(df.created, infer_datetime_format = True)
df.month = df.created.dt.month

train = df[df.month < 6]
test = df[df.month == 6]

train.price.mean()

  


3432.7534190068222

In [65]:
features = [
    'bathrooms',
    'bedrooms',
    'longitude',
    ]

target = 'price'

X_train = train[features]
y_train = train[target]

X_test = test[features]
y_test = test[target]

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

mean_absolute_error(y_test, y_pred)

667.293314729857

In [66]:
features = [
    'bathrooms',
    'bedrooms',
    'doorman',
    'longitude',    
    'interest_level'
  ]

target = 'price'

X_train = train[features]
y_train = train[target]

X_test = test[features]
y_test = test[target]

X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test)

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(f'MAE: {mean_absolute_error(y_test, y_pred)}')
print(f'Intercept: {model.intercept_}')
model.coef_
              

MAE: 606.4804449587706
Intercept: 3431.9438196502865


array([ 5.46283072e+02,  5.62402433e+02,  2.38591837e+02, -3.72466875e+02,
       -6.65512642e+15, -1.14270815e+16, -1.04346006e+16])

I ran the above code with multiple sets of feature combinations. The five features I have finished with, _bathrooms_, _bedrooms_, _longitude_, _interest_level_, and _doorman_ seemed to be the ones that gave me the most _bang for my buck_.

In [67]:
from sklearn.linear_model import Ridge

reg = Ridge(alpha = 0.5)

features = [
    'bathrooms',
    'bedrooms',
    'doorman',
    'longitude',
    'interest_level'
      ]

target = 'price'

X_train = train[features]
y_train = train[target]

X_test = test[features]
y_test = test[target]


X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test)

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

reg.fit(X_train, y_train)

y_pred = reg.predict(X_test)

print(f'MAE: {mean_absolute_error(y_test, y_pred)}')
print(f'Intercept: {reg.intercept_}')
reg.coef_

MAE: 606.3352938459542
Intercept: 3432.7534190069337


array([ 543.16250263,  559.54176157,  237.21277821, -372.80510009,
       -110.29890353,  131.10550301,  -73.22753903])

Using ridge regression, I started the code with every feature and only seen an 11 point improvment in the MAE of the model. I then removed features one-by-one, each time removing the feature with the lowest absolute coefficient, seeing barely a change in the model each time, unless i removed one of the remaining features. With the same features as with OLS, Ridge regression showed no significant improvement.

In [0]:
import pandas as pd
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
import category_encoders as ce
from sklearn.preprocessing import StandardScaler

encoder = ce.OneHotEncoder(use_cat_names = True)
scaler = StandardScaler()

features = [
    'balcony',
    'bathrooms',
    'bedrooms',
    'cats_allowed',
    'common_outdoor_space',
    'dining_room',
    'dishwasher',
    'doorman',
    'exclusive',
    'elevator',
    'fitness_center',
    'garden_patio',
    'hardwood_floors',
    'high_speed_internet',
    'latitude',
    'laundry_in_building',
    'laundry_in_unit',
    'loft',
    'longitude',
    'new_construction',
    'no_fee',
    'outdoor_space',
    'roof_deck',
    'swimming_pool',
    'terrace',
    'wheelchair_access',
    'interest_level',
       ]

target = 'price'

X_train = train[features]
y_train = train[target]

X_test = test[features]
y_test = test[target]

X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test)

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

estimator = LinearRegression()
selector = RFE(estimator, 7, step = 1)
selector = selector.fit(X_train, y_train)


In [91]:
df_2 = (pd.DataFrame(list(zip(features, selector.support_, selector.ranking_)), 
                     columns = ['Feature', 'Selected Feature', 'Feature Ranking'])
        .sort_values(['Feature Ranking', 'Selected Feature']))
df_2.head(8)

Unnamed: 0,Feature,Selected Feature,Feature Ranking
1,bathrooms,True,1
2,bedrooms,True,1
7,doorman,True,1
18,longitude,True,1
26,interest_level,True,1
16,laundry_in_unit,False,2
13,high_speed_internet,False,3
9,elevator,False,4
