Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [0]:
df.head(3)

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [0]:
df.dtypes

bathrooms               float64
bedrooms                  int64
created                  object
description              object
display_address          object
latitude                float64
longitude               float64
price                     int64
street_address           object
interest_level           object
elevator                  int64
cats_allowed              int64
hardwood_floors           int64
dogs_allowed              int64
doorman                   int64
dishwasher                int64
no_fee                    int64
laundry_in_building       int64
fitness_center            int64
pre-war                   int64
laundry_in_unit           int64
roof_deck                 int64
outdoor_space             int64
dining_room               int64
high_speed_internet       int64
balcony                   int64
swimming_pool             int64
new_construction          int64
terrace                   int64
exclusive                 int64
loft                      int64
garden_p

In [0]:
# 2. Engineer at least two new features. (See below for explanation & ideas.)

In [0]:
df["interest_level"].replace({"high": 3, "medium": 2, "low": 1}, inplace=True)

In [0]:

# A feature that adds most of the perks into a single column value
df['perk_sum'] = df['bedrooms'] + df['bathrooms'] + df['laundry_in_building'] + df['swimming_pool'] + df['fitness_center'] + df['balcony'] + df['loft'] + df['garden_patio'] + df['high_speed_internet'] + df['outdoor_space'] + df['exclusive'] +df['dishwasher']
# A feature that converts the interest level into a numeric value then adds it to the # bathrooms and bedrooms
df['bedbath_interest'] = df["interest_level"] + df['bedrooms'] + df['bathrooms']

In [0]:
# 1. Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.

    # A. "created" column is not date time format... think I have to convert it to pull specific date range?

In [0]:
# Make a new column of "created" in a date time format:

df['created_in_dt'] =  pd.to_datetime(df['created'])

In [0]:
# Pull the specific dates for train data:

train = df[(df['created_in_dt'] >= '2016-04-01') & (df['created_in_dt'] < '2016-06-01')]
train = train.sort_values('created_in_dt')

print ("From this date:" , train['created_in_dt'].min() ,"------------- to this date:" , train['created_in_dt'].max())

From this date: 2016-04-01 22:12:41 ------------- to this date: 2016-05-31 23:10:48


In [0]:
# Pull the specific dates for test data:

test = df[(df['created_in_dt'] >= '2016-06-01') & (df['created_in_dt'] < '2016-07-01')]
test = test.sort_values('created_in_dt')

print ("From this date:" , test['created_in_dt'].min() ,"------------- to this date:" , test['created_in_dt'].max())

From this date: 2016-06-01 01:10:37 ------------- to this date: 2016-06-29 21:41:47


In [0]:
# Importing necessary modules

import plotly.express as px
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


In [0]:
# Instantiating class

model = LinearRegression()

In [0]:
# Assigning features, and target

features = ['perk_sum']
target = ['price']

In [0]:
# Assigning X and Y train variables

x_train = df[features]
y_train = df[target]

In [0]:
# Linear regression with train data

model.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [0]:
# Price per point of "perk_sum" value:

model.coef_

array([[501.90404122]])

In [0]:
# Minimum rent for a "perk_sum" value of 0:

model.intercept_

array([1601.14751724])

In [0]:
# x_test

amenity_predict = 19

x_test = [[amenity_predict]]

y_pred = model.predict(x_test)

print("If the 'perk_sum' value is 19 then the montly rent is predicted to be ",y_pred)


If the 'perk_sum' value is 19 then the montly rent is predicted to be  [[11137.32430033]]


In [0]:
# Mean absolute error

# Get the baseline

guess = y_train.mean()

y_pred = [guess] * len(y_train)

mae = mean_absolute_error(y_train, y_pred)

print ("Mean absolute error:", mae)


Mean absolute error: 1201.532252154329


In [0]:
mse = mean_squared_error(y_train, y_pred)

print ("Mean squared error:", mse)

Mean squared error: 3106098.59781418


In [0]:
rmse = np.sqrt(mse)

print ("Root mean squared error:", rmse)

Root mean squared error: 1762.4127206231178


In [0]:
y = df[target]

In [0]:
r2 = r2_score(y, y_pred)

print ("R^2:", r2)

R^2: 0.0


In [0]:
# Test Data ---

In [0]:
x_test_data = test[features]
y_test_data = test[target]

In [0]:
model.fit(x_test_data, y_test_data)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [0]:
model.coef_

array([[513.26225222]])

In [0]:
model.intercept_

array([1572.27429167])

In [0]:
amenity_predict = 19

x_test_data_test = [[amenity_predict]]

y_test_data_predict = model.predict(x_test_data_test)

print("If the 'perk_sum' value is 19 then the montly rent is predicted to be ",y_test_data_predict)

If the 'perk_sum' value is 19 then the montly rent is predicted to be  [[11324.25708384]]


In [0]:
guess = y_test_data.mean()

y_test_data_predict = [guess] * len(y_test_data)

mae2 = mean_absolute_error(x_test_data, y_test_data_predict)

print ("Mean absolute error:", mae2)

Mean absolute error: 3583.129234666823


In [0]:
mse2 = mean_squared_error(x_test_data, y_test_data_predict)

print ("Mean squared error:", mse2)

Mean squared error: 12838819.543738348


In [0]:
rmse2 = np.sqrt(mse2)

print ("Root mean squared error:", rmse2)

Root mean squared error: 3583.129853038869


In [0]:
y2 = test[target]

In [0]:
r2test = r2_score(y2, y_test_data_predict)

print ("R^2:", r2test)

R^2: 0.0
