Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [5]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [6]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [7]:
df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,...,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,...,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,...,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,...,0,0,0,0,0,0,0,0,0,0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,...,0,0,0,0,0,0,0,0,0,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,...,0,0,0,0,0,0,0,0,0,0


In [8]:
# Creating 2 new features, total number of bedrooms and bathrooms & ratio bed to bath:
df['bedbath'] = df['bedrooms'] + df['bathrooms']
df['bedbath_ratio'] = df['bedrooms']/df['bathrooms']
df.sample(1)

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,...,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,bedbath,bedbath_ratio
31384,1.0,3,2016-05-07 06:06:01,This Spacious convertible 3 bedroom apartment...,Rector Street,40.7083,-74.0149,3350,33 Rector Street,low,...,0,0,0,0,0,0,0,0,4.0,3.0


In [9]:
# Finding inf values and changing them to NaNs:
df = df.replace([np.inf, -np.inf], np.nan)

In [10]:
# Removing NaN values:
df = df.dropna()

In [11]:
# Selecting date ranges for train and test data:
train = df[(df['created'] >= '2016-04-01') & (df['created'] <= '2016-05-31')]
test = df[(df['created'] >= '2016-06-01') & (df['created'] <= '2016-06-30')]

In [12]:
# Let's look at the shape of the new df's:
train.shape, test.shape

((30610, 36), (16341, 36))

In [13]:
# Start with the baseline, price:
train['price'].mean()

3574.628552760536

In [14]:
# y target vectors:
target = 'price'
y_train = train[target]
y_test = test[target]

In [15]:
# Calculate the mean baseline:
guess = y_train.mean()

In [16]:
# Calculate the train error:
from sklearn.metrics import mean_absolute_error
y_pred = [guess] * len(y_train)
mae = mean_absolute_error(y_train, y_pred)

In [17]:
# Calculate test error:
y_pred = [guess] * len(y_test)
mae = mean_absolute_error(y_test, y_pred)

In [18]:
# Import estimator and instantiate the class:
from sklearn.linear_model import LinearRegression
model = LinearRegression()

In [19]:
# Setting X features matrices:
features = ['bedbath', 'bedbath_ratio']
X_train = train[features]
X_train

Unnamed: 0,bedbath,bedbath_ratio
2,2.0,1.0
3,2.0,1.0
4,5.0,4.0
5,6.0,2.0
6,3.0,2.0
...,...,...
49346,2.0,1.0
49348,2.0,1.0
49349,2.0,1.0
49350,1.0,0.0


In [20]:
X_test = test[features]
X_test

Unnamed: 0,bedbath,bedbath_ratio
0,4.5,2.0
1,3.0,2.0
11,2.0,1.0
14,2.0,1.0
24,6.0,2.0
...,...,...
49305,3.0,2.0
49310,4.0,3.0
49320,2.0,1.0
49332,3.0,2.0


In [21]:
# Making sure inf and NaNs are gone:
X_test.bedbath_ratio.value_counts()

1.000000    6265
2.000000    4239
0.000000    2999
3.000000    1285
1.500000     958
1.333333     127
4.000000     119
0.666667      71
0.500000      68
2.500000      49
1.200000      33
0.800000      24
2.666667      21
1.600000      15
1.666667      13
1.142857      12
0.857143      12
1.250000       7
0.333333       6
0.750000       5
5.000000       3
0.888889       3
1.428571       2
3.333333       2
0.400000       2
0.571429       1
Name: bedbath_ratio, dtype: int64

In [22]:
# Fitting model:
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [23]:
# Now applying the model to the new data:
y_pred_train = model.predict(X_train)
y_pred_train
mean_absolute_error(y_pred_train, y_train)

817.986916488347

In [24]:
y_pred = model.predict(X_test)
mean_absolute_error(y_pred, y_test)

825.5067594878676

In [25]:
# Calculating intercept and coefficient:
model.intercept_, model.coef_

(1425.9191833262303, array([1133.40460069, -767.81330737]))

In [26]:
beta0 = model.intercept_
beta1, beta2 = model.coef_
print(f'y = {beta0:.2f} + {beta1:.2f}x1 + {beta2:.2f}x2')

y = 1425.92 + 1133.40x1 + -767.81x2


In [27]:
# Per lecture, this will make an easier way to read the coefficient and intercept:
print('Intercept', model.intercept_)
coefficients = pd.Series(model.coef_, features)
print(coefficients.to_string())

Intercept 1425.9191833262303
bedbath          1133.404601
bedbath_ratio    -767.813307


In [28]:
# Calculating RMSE, MAE & R2:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [32]:
# MSE test:
mse = mean_squared_error(y_test, y_pred)
mse

1501420.9837258141

In [33]:
# MAE test:
new_mae = mean_absolute_error(y_test, y_pred)
new_mae

825.5067594878676

In [34]:
# RMSE test:
rmse = np.sqrt(mse)
rmse

1225.324848244666

In [35]:
# R2 test:
r2 = r2_score(y_test, y_pred)
r2

0.5193267994895671