<a href="https://colab.research.google.com/github/JeffreyAsuncion/DS-Unit-2-Linear-Models/blob/master/DSPT6_U2S1M2_JeffreyAsuncion_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [x] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [x] Engineer at least two new features. (See below for explanation & ideas.)
- [x] Fit a linear regression model with at least two features.
- [x] Get the model's coefficients and intercept.
- [x] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [3]:
df.shape

(48817, 34)

In [4]:
df.columns

Index(['bathrooms', 'bedrooms', 'created', 'description', 'display_address',
       'latitude', 'longitude', 'price', 'street_address', 'interest_level',
       'elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed',
       'doorman', 'dishwasher', 'no_fee', 'laundry_in_building',
       'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck',
       'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony',
       'swimming_pool', 'new_construction', 'terrace', 'exclusive', 'loft',
       'garden_patio', 'wheelchair_access', 'common_outdoor_space'],
      dtype='object')

In [5]:
df.head(2)

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [6]:
# date df['created']
df['created'].describe()

count                   48817
unique                  48148
top       2016-05-17 14:29:08
freq                        3
Name: created, dtype: object

Split the data into two sets
====================

- Train data from April & May 2016 
- Test data from June 2016 

In [7]:
# Split the data into two sets

train = df[df['created'] < '2016-06-01']
test = df[df['created'] >= '2016-06-01']
train.shape, test.shape

((31844, 34), (16973, 34))

 Create 2 New Features 
==========

- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

In [0]:
# to Avoid SettingWithCopyWarning
train = train.copy()
test = test.copy()

In [9]:
train.columns

Index(['bathrooms', 'bedrooms', 'created', 'description', 'display_address',
       'latitude', 'longitude', 'price', 'street_address', 'interest_level',
       'elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed',
       'doorman', 'dishwasher', 'no_fee', 'laundry_in_building',
       'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck',
       'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony',
       'swimming_pool', 'new_construction', 'terrace', 'exclusive', 'loft',
       'garden_patio', 'wheelchair_access', 'common_outdoor_space'],
      dtype='object')

Total number of rooms (beds + baths)

In [0]:
# Total number of rooms (beds + baths)
train['total_num_of_rooms'] = train['bedrooms'] + train['bathrooms']
test['total_num_of_rooms'] = test['bedrooms'] + test['bathrooms']

In [11]:
train[['total_num_of_rooms', 'bedrooms', 'bathrooms']].head(2)

Unnamed: 0,total_num_of_rooms,bedrooms,bathrooms
2,2.0,1,1.0
3,2.0,1,1.0


In [12]:
test[['total_num_of_rooms', 'bedrooms', 'bathrooms']].head(2)

Unnamed: 0,total_num_of_rooms,bedrooms,bathrooms
0,4.5,3,1.5
1,3.0,2,1.0


Ratio of beds to baths

In [0]:
# Ratio of beds to baths
train['ratio_bed_to_baths'] = train['bedrooms'] / train['bathrooms']
test['ratio_bed_to_baths'] = test['bedrooms'] / test['bathrooms']

In [14]:
# check the 'ratio_bed_to_baths
train[['ratio_bed_to_baths', 'bedrooms', 'bathrooms']].head(2)

Unnamed: 0,ratio_bed_to_baths,bedrooms,bathrooms
2,1.0,1,1.0
3,1.0,1,1.0


In [15]:
train['ratio_bed_to_baths'].value_counts()
#### inf == infinity   
#### Really that's no good

1.000000    12495
2.000000     7892
0.000000     6073
3.000000     2288
1.500000     1801
4.000000      239
1.333333      216
0.500000      133
0.666667      116
1.200000       95
inf            92
2.500000       76
0.800000       59
2.666667       40
1.250000       24
1.600000       23
1.666667       22
1.142857       15
0.857143       12
0.333333        8
5.000000        7
0.750000        6
0.888889        4
3.333333        3
0.200000        1
0.222222        1
2.333333        1
2.400000        1
6.000000        1
1.428571        1
Name: ratio_bed_to_baths, dtype: int64

How many total perks does each apartment have?

In [16]:
train.columns

Index(['bathrooms', 'bedrooms', 'created', 'description', 'display_address',
       'latitude', 'longitude', 'price', 'street_address', 'interest_level',
       'elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed',
       'doorman', 'dishwasher', 'no_fee', 'laundry_in_building',
       'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck',
       'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony',
       'swimming_pool', 'new_construction', 'terrace', 'exclusive', 'loft',
       'garden_patio', 'wheelchair_access', 'common_outdoor_space',
       'total_num_of_rooms', 'ratio_bed_to_baths'],
      dtype='object')

In [0]:
perks =['elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed',
       'doorman', 'dishwasher', 'no_fee', 'laundry_in_building',
       'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck',
       'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony',
       'swimming_pool', 'new_construction', 'terrace', 'exclusive', 'loft',
       'garden_patio', 'wheelchair_access', 'common_outdoor_space']

In [18]:
train[perks].sample(5)

Unnamed: 0,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
1125,0,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
48705,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0
7347,1,1,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
41262,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
28108,0,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [19]:
# create new feature 'total_perks'
train['total_perks'] = train[perks].sum(axis=1)
test['total_perks'] = test[perks].sum(axis=1)
train[['total_perks']].sample(5) 

Unnamed: 0,total_perks
40466,12
28594,0
42450,5
594,9
42176,9


##What's the neighborhood, based on address or latitude & longitude?

In [20]:
## how to you do this????
## let's look at the columns given
train.columns # hello again

Index(['bathrooms', 'bedrooms', 'created', 'description', 'display_address',
       'latitude', 'longitude', 'price', 'street_address', 'interest_level',
       'elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed',
       'doorman', 'dishwasher', 'no_fee', 'laundry_in_building',
       'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck',
       'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony',
       'swimming_pool', 'new_construction', 'terrace', 'exclusive', 'loft',
       'garden_patio', 'wheelchair_access', 'common_outdoor_space',
       'total_num_of_rooms', 'ratio_bed_to_baths', 'total_perks'],
      dtype='object')

In [21]:
train[['display_address','latitude', 'longitude','street_address']].sample(5)

Unnamed: 0,display_address,latitude,longitude,street_address
30860,West 102nd Street,40.7981,-73.9685,214 West 102nd Street
38259,Broadway,40.7108,-73.9672,53 Broadway
34944,West 34th Street,40.753,-73.9958,360 West 34th Street
44683,8th Ave.,40.7637,-73.9846,888 8th Ave.
37681,West 34th Street,40.7494,-73.9875,50 West 34th Street


In [22]:
train['latitude'].value_counts().sum()
# what if we just include the long and lat into the linear regression without feature engineering it
# that's ludicrous/ludicris ;-o

## ok let's try

31844

Fit a linear regression model with at least two features.
========


In [23]:
# 1. Begin with baseline (0 features)

train['price'].mean()

3575.604007034292

In [0]:
# 2. Arrange y target vectors
target = 'price'
y_train = train[target]
y_test = test[target] 

In [25]:
# 3. Get mean baseline
print('Mean Baseline (using 0 Features)')
guess = y_train.mean()
# make sure that, this mean is only from the train data
# do not contaminate with the test data 

Mean Baseline (using 0 Features)


In [26]:
# Train Error
from sklearn.metrics import mean_absolute_error
y_pred = [guess] * len(y_train)
mae = mean_absolute_error(y_train, y_pred)
print(f'Train Error (April and May 2016) :  ${mae:.2f}')

Train Error (April and May 2016) :  $1201.88


In [28]:
# Test Error
y_pred = [guess] * len(y_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'Train Error (June 2016) :  ${mae:.2f}')

Train Error (June 2016) :  $1197.71


In [0]:
# 1. Import the appropriate estimator class from SciKit-Learn
from sklearn.linear_model import LinearRegression

In [0]:
# 2. Instantiate this class
model = LinearRegression()

In [31]:
# 3. Arrange X features matrices (alread did y targets vectors)
# Let's use the 2 new features 
features = ['bedrooms', 'bathrooms', 'total_perks','latitude', 'longitude' ] #added lat and long
X_train = train[features]
X_test = test[features]
print(f'Linear Regression, dependent on: {features}')

Linear Regression, dependent on: ['bedrooms', 'bathrooms', 'total_perks', 'latitude', 'longitude']


In [36]:
# 4. Fit the model
model.fit(X_train, y_train)
y_pred_train = model.predict(X_train)
mae = mean_absolute_error(y_train, y_pred_train)
print(f'Train Error: ${mae:.2f}')

Train Error: $729.17


In [37]:
# 5. Apply the model to new data
y_pred_test = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred_test)
print(f'Test Error: ${mae:.2f}')

Test Error: $733.71


Get the model's coefficients and intercept.
========


In [35]:
model.intercept_, model.coef_

(-1177127.8270060434,
 array([   426.65490989,   1927.55430941,     46.52788523,   1825.66915686,
        -14912.42176326]))

In [41]:
beta0 = model.intercept_
beta1, beta2, beta3, beta4, beta5 = model.coef_
print(f'intercept : {beta0:.2f}')
print(f'beta1 : {beta1:.2f}')
print(f'beta2 : {beta2:.2f}')
print(f'beta3 : {beta3:.2f}\n')
print(f'beta4 : {beta4:.2f}\n')
print(f'beta5 : {beta5:.2f}\n')
print(f'y = {beta0:.2f} + {beta1:.2f}*x1 + {beta2:.2f}*x2 + {beta3:.2f}*x3+ {beta4:.2f}*x4 + {beta5:.2f}*x5\n')
print(f'price = {beta0:.2f} + {beta1:.2f}*num_of_bedrooms + {beta2:.2f}*num_of_bathrooms + {beta3:.2f}*total_perks + {beta4:.2f}*lat + {beta5:.2f}*long')

intercept : -1177127.83
beta1 : 426.65
beta2 : 1927.55
beta3 : 46.53

beta4 : 1825.67

beta5 : -14912.42

y = -1177127.83 + 426.65*x1 + 1927.55*x2 + 46.53*x3+ 1825.67*x4 + -14912.42*x5

price = -1177127.83 + 426.65*num_of_bedrooms + 1927.55*num_of_bathrooms + 46.53*total_perks + 1825.67*lat + -14912.42*long


Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
======


In [0]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [43]:
print(f'y = {beta0:.2f} + {beta1:.2f}*x1 + {beta2:.2f}*x2 + {beta3:.2f}*x3+ {beta4:.2f}*x4 + {beta5:.2f}*x5\n')
print(f'price = {beta0:.2f} + {beta1:.2f}*num_of_bedrooms + {beta2:.2f}*num_of_bathrooms + {beta3:.2f}*total_perks + {beta4:.2f}*lat + {beta5:.2f}*long')

y = -1177127.83 + 426.65*x1 + 1927.55*x2 + 46.53*x3+ 1825.67*x4 + -14912.42*x5

price = -1177127.83 + 426.65*num_of_bedrooms + 1927.55*num_of_bathrooms + 46.53*total_perks + 1825.67*lat + -14912.42*long


In [0]:
# # Make Predictions
# # for train dataset
# x1_train = train['bedrooms']
# x2_train = train['bathrooms']
# x3_train = train['total_perks']
# x4_train = train['latitude']
# x5_train = train['longitude']
# y_train = train['price']
# y_pred_train = beta0 + beta1*x1_train + beta2*x2_train + beta3*x3_train + beta4*x4_train + beta5*x5_train

# #### why does this not work????? help S.O.S.
# ####y_pred_train = model.predict([x1_train, x2_train, x3_train, x4_train, x5_train])

# # for test dataset
# x1_test = test['bedrooms']
# x2_test = test['bathrooms']
# x3_test = test['total_perks']
# x4_test = test['latitude']
# x5_test = test['longitude']
# y_test = test['price']
# y_pred_test = beta0 + beta1*x1_test + beta2*x2_test + beta3*x3_test + beta4*x4_test + beta5*x5_test

In [44]:
# Print Regression Metrics
mse_train = mean_squared_error(y_train, y_pred_train)
rmse_train = np.sqrt(mse_train)
mae_train = mean_absolute_error(y_train, y_pred_train)
r2_train = r2_score(y_train, y_pred_train)

mse_test = mean_squared_error(y_test, y_pred_test)
rmse_test = np.sqrt(mse_test)
mae_test = mean_absolute_error(y_test, y_pred_test)
r2_test = r2_score(y_test, y_pred_test)

print(f'Mean Squared Error \ttrain : {mse_train}\ttest : {mse_test}')
print(f'Root Mean Squared Error\ttrain : {rmse_train}\ttest : {rmse_test}')
print(f'Mean Absolute Error \ttrain : {mae_train}\ttest : {mae_test}')
print(f'R^2 \t\t\ttrain : {r2_train}\ttest : {r2_test}')

Mean Squared Error 	train : 1291105.5883488245	test : 1258411.9880555514
Root Mean Squared Error	train : 1136.2682730538702	test : 1121.7896362756928
Mean Absolute Error 	train : 729.1721673898433	test : 733.7071465521932
R^2 			train : 0.5841887745465741	test : 0.5951083085975691


 ## Before adding the latitude and longitude

error test                | train                       | test
--------------------|-----------------------------|-------------
Mean Squared Error 	|  1449490.55	|  1421073.10
Root Mean Squared Error	|  1203.94	|  1192.08
Mean Absolute Error 	| 795.40	| 799.73
R^2 	|		 0.5331 | 0.5427

## After adding Latitude and Longitude with Feature Engineering

error test                | train                       | test
--------------------|-----------------------------|-------------
Mean Squared Error | 1291105.58| 1258411.98
Root Mean Squared Error| 1136.26 | 1121.78
Mean Absolute Error | 729.17 | 733.70
R^2 		| 0.5841	| 0.5951

<div align="center">

There is a 60%(test) Reduction invariance


when we take into account the following features
(bedrooms, bathrooms, total_perks, latitude, longitude) into account.

Alternatively, we can say that the following features


(bedrooms, bathrooms, total_perks, latitude, longitude)
explain 60%(test) of the variance in price.


In [0]:
# There is a 53%(train) & 54%(test) Reduction invariance
# when we take into account the following features
# (bedrooms, bathrooms, total_perks) into account

# Alternatively, we can say that the following features
# (bedrooms, bathrooms, total_perks)
# explain 53%(train) & 54%(test) of the variance in price.

# thanks StatQuest

This is a great way to explain the R^2
https://www.youtube.com/watch?v=nk2CQITm_eo&feature=youtu.be

What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
========



As always, commit your notebook to your fork of the GitHub repo.
=========
