<a href="https://colab.research.google.com/github/Mena-E/DS-Unit-2-Linear-Models/blob/master/LS_DS_212_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [None]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [None]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv', parse_dates=['created'], index_col='created')
#assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [None]:
print(df.shape)
df.head(2)

(48817, 33)


Unnamed: 0_level_0,bathrooms,bedrooms,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
created,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1
2016-06-24 07:54:24,1.5,3,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2016-06-12 12:19:27,1.0,2,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
# Check the categorical variables
df.describe(exclude='number')

Unnamed: 0,description,display_address,street_address,interest_level
count,47392.0,48684,48807,48817
unique,37853.0,8674,15135,3
top,,Broadway,3333 Broadway,low
freq,1627.0,435,174,33946


In [None]:
# Drop the high cardinality variables from the datafram
df.drop(columns=['description', 'display_address',	'street_address'], inplace=True)
df.shape # Number of clumns should now go from 33 to 30

(48817, 30)

In [None]:
# Make dummies of the interest_level variable, and get k-1 columns of dummies from k columns
# Concatenate the dummy dataframe with the main dataframe
dummy = pd.get_dummies(df['interest_level'], prefix='interest', prefix_sep='_', drop_first=True)
df = pd.concat([df, dummy], axis = 1)
print(df.shape) # There should now be 32 total columns in the dataframe
df.head()

(48817, 32)


Unnamed: 0_level_0,bathrooms,bedrooms,latitude,longitude,price,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,interest_low,interest_medium
created,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1
2016-06-24 07:54:24,1.5,3,40.7145,-73.9425,3000,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2016-06-12 12:19:27,1.0,2,40.7947,-73.9667,5465,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
2016-04-17 03:26:41,1.0,1,40.7388,-74.0018,2850,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2016-04-18 02:22:02,1.0,1,40.7539,-73.9677,3275,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
2016-04-28 01:32:41,1.0,4,40.8241,-73.9493,3350,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0


# Feature Engineering

In [None]:
df.columns

Index(['bathrooms', 'bedrooms', 'latitude', 'longitude', 'price',
       'interest_level', 'elevator', 'cats_allowed', 'hardwood_floors',
       'dogs_allowed', 'doorman', 'dishwasher', 'no_fee',
       'laundry_in_building', 'fitness_center', 'pre-war', 'laundry_in_unit',
       'roof_deck', 'outdoor_space', 'dining_room', 'high_speed_internet',
       'balcony', 'swimming_pool', 'new_construction', 'terrace', 'exclusive',
       'loft', 'garden_patio', 'wheelchair_access', 'common_outdoor_space',
       'interest_low', 'interest_medium'],
      dtype='object')

In [None]:
# Create new feature num_perks which counts all the perks in an apartment
perks = ['elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed',
         'doorman', 'dishwasher', 'no_fee', 'laundry_in_building',
         'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck',
         'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony',
         'swimming_pool', 'new_construction', 'exclusive', 'terrace', 
         'loft', 'garden_patio', 'common_outdoor_space', 
         'wheelchair_access']

# total rooms feature
df['total_rooms'] = df['bathrooms'] + df['bedrooms']

# feature for total number of perks
df['num_perks'] = df[perks].sum(axis=1)

# Create new feature for pets called pets_allowed

df['pets_allowed'] = np.where((df['cats_allowed']==1)|(df['dogs_allowed']==1), 1,0)

# Feature for all pets allowed
df['all_pets_allowed'] = np.where((df['cats_allowed']==1)&(df['dogs_allowed']==1), 1,0)

# Drop the interest_level column
df.drop(columns=['interest_level'], inplace=True)

print(df.shape) # Number of columns should now be 33
df.columns

(48817, 35)


Index(['bathrooms', 'bedrooms', 'latitude', 'longitude', 'price', 'elevator',
       'cats_allowed', 'hardwood_floors', 'dogs_allowed', 'doorman',
       'dishwasher', 'no_fee', 'laundry_in_building', 'fitness_center',
       'pre-war', 'laundry_in_unit', 'roof_deck', 'outdoor_space',
       'dining_room', 'high_speed_internet', 'balcony', 'swimming_pool',
       'new_construction', 'terrace', 'exclusive', 'loft', 'garden_patio',
       'wheelchair_access', 'common_outdoor_space', 'interest_low',
       'interest_medium', 'total_rooms', 'num_perks', 'pets_allowed',
       'all_pets_allowed'],
      dtype='object')

# Split the Data
1. The training data will include all data from April and May
2. The test data will include all June data

In [None]:
# Split on dataframe date index
train = df[df.index < '2016-06-01']
test = df[df.index >= '2016-06-01']
train.shape, test.shape          

((31844, 35), (16973, 35))

In [None]:
# Define target and feature
target = 'price'
features = df.columns.drop([target])

In [None]:
# Create X and y training and testing dataframes
X_train = train[features]
y_train = train[target]

X_test = test[features]
y_test = test[target]

print('X shapes (train, test): {} {}'.format(X_train.shape, X_test.shape))
print('y shapes (train, test): {} {}'.format(y_train.shape, y_test.shape))

X shapes (train, test): (31844, 34) (16973, 34)
y shapes (train, test): (31844,) (16973,)


# Baseline

In [None]:
# Mean rental price in NYC - Mean Baseline
first_guess = train[target].mean()
print(f'The average rental price in NYC from the data is ${first_guess:0.2f}')

The average rental price in NYC from the data is $3575.60


In [None]:
# Baseline errors
from sklearn.metrics import mean_absolute_error as mae, mean_squared_error as mse, r2_score

y_pred = [first_guess] * len(y_train)

def find_errors(test_on, data, pred):
  """
  Function takes the following parameters:
  test_on: A string, 'Baseline', 'Test' or 'Train'
  data: Dataframe, could be training or testing
  pred: target or outcome variable
  Function returns model errors and R Squared value
  """
  MAE = mae(data, pred)
  MSE = mse(data, pred)
  RMSE = np.sqrt(MSE)
  r2 = r2_score(data, pred)
  print(test_on + " Errors: ")
  print("----------------------")
  print(f'Mean Square Error: ${MSE:0.2f}')
  print(f'Root Mean Square Error: ${RMSE:0.2f}')
  print(f'Mean Absolute Error: ${MAE:0.2f}')
  print(f'R Sqaured: {r2:0.2f}')

find_errors('Baseline',y_train, y_pred)

Baseline Errors: 
----------------------
Mean Square Error: $3105028.22
Root Mean Square Error: $1762.11
Mean Absolute Error: $1201.88
R Sqaured: 0.00


#Build and Fit the Model

In [None]:
# Import Regression model
from sklearn.linear_model import LinearRegression

model = LinearRegression()

# Rearrange X features - Start by using all the features
X_train = train[features]
X_test = test[features]

model.fit(X_train, y_train)
y_pred = model.predict(X_train)

find_errors('Train', y_train, y_pred)

Train Errors: 
----------------------
Mean Square Error: $1125382.17
Root Mean Square Error: $1060.84
Mean Absolute Error: $673.10
R Sqaured: 0.64


In [None]:
# Predict using test data

y_pred = model.predict(X_test)
find_errors('Test', y_test, y_pred)

Test Errors: 
----------------------
Mean Square Error: $1094203.50
Root Mean Square Error: $1046.04
Mean Absolute Error: $676.19
R Sqaured: 0.65


In [None]:
features

Index(['bathrooms', 'bedrooms', 'latitude', 'longitude', 'elevator',
       'cats_allowed', 'hardwood_floors', 'dogs_allowed', 'doorman',
       'dishwasher', 'no_fee', 'laundry_in_building', 'fitness_center',
       'pre-war', 'laundry_in_unit', 'roof_deck', 'outdoor_space',
       'dining_room', 'high_speed_internet', 'balcony', 'swimming_pool',
       'new_construction', 'terrace', 'exclusive', 'loft', 'garden_patio',
       'wheelchair_access', 'common_outdoor_space', 'interest_low',
       'interest_medium', 'total_rooms', 'num_perks', 'pets_allowed',
       'all_pets_allowed'],
      dtype='object')

In [None]:
# Trial with selected features
more_features = ['bathrooms','bedrooms','latitude','longitude','interest_low',
                 'num_perks','all_pets_allowed']
X_test = test[more_features]
X_train = train[more_features]
y_test = test[target]
y_train = train[target]
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
find_errors('Test', y_test, y_pred)

Test Errors: 
----------------------
Mean Square Error: $1168831.36
Root Mean Square Error: $1081.13
Mean Absolute Error: $701.96
R Sqaured: 0.62


In [None]:
model.intercept_, model.coef_

(-1099936.7749047894,
 array([  1832.4111839 ,    454.96854283,   1380.10871004, -14108.90450905,
           644.19044755,     57.77075324,    -52.74503685]))

In [None]:

def big_bang(features):
  """
  function takes a list of features, iterates through the list,
  appends each feature to a new feature list, runs a new prediction
  and calculates errors for each additional feature appended to the
  new_features list.
  """
  new_features = []
  for i in range(len(features)):
    new_features.append(features[i])
    X_test = test[new_features]
    X_train = train[new_features]
    y_test = test[target]
    y_train = train[target]
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    i += 1
    print(f'Test with {i+0} features')
    find_errors('Test', y_test, y_pred)
    print(" ")

In [None]:
features # features to be iterated through

Index(['bathrooms', 'bedrooms', 'latitude', 'longitude', 'elevator',
       'cats_allowed', 'hardwood_floors', 'dogs_allowed', 'doorman',
       'dishwasher', 'no_fee', 'laundry_in_building', 'fitness_center',
       'pre-war', 'laundry_in_unit', 'roof_deck', 'outdoor_space',
       'dining_room', 'high_speed_internet', 'balcony', 'swimming_pool',
       'new_construction', 'terrace', 'exclusive', 'loft', 'garden_patio',
       'wheelchair_access', 'common_outdoor_space', 'interest_low',
       'interest_medium', 'total_rooms', 'num_perks', 'pets_allowed',
       'all_pets_allowed'],
      dtype='object')

In [None]:
big_bang(features) # function call to iterate through features

Test with 1 features
Test Errors: 
----------------------
Mean Square Error: $1615207.84
Root Mean Square Error: $1270.91
Mean Absolute Error: $890.46
R Sqaured: 0.48
 
Test with 2 features
Test Errors: 
----------------------
Mean Square Error: $1487715.31
Root Mean Square Error: $1219.72
Mean Absolute Error: $825.90
R Sqaured: 0.52
 
Test with 3 features
Test Errors: 
----------------------
Mean Square Error: $1483100.39
Root Mean Square Error: $1217.83
Mean Absolute Error: $820.07
R Sqaured: 0.52
 
Test with 4 features
Test Errors: 
----------------------
Mean Square Error: $1279602.32
Root Mean Square Error: $1131.20
Mean Absolute Error: $744.98
R Sqaured: 0.59
 
Test with 5 features
Test Errors: 
----------------------
Mean Square Error: $1254481.13
Root Mean Square Error: $1120.04
Mean Absolute Error: $730.03
R Sqaured: 0.60
 
Test with 6 features
Test Errors: 
----------------------
Mean Square Error: $1251630.73
Root Mean Square Error: $1118.76
Mean Absolute Error: $728.56
R Sq

## Best test MAE is $676.07 with an R-Sqaured of 0.65 
## Used 31 features from the dataframe.

In [None]:
# Features producing lowest MAE - Training values
final_features = ['bathrooms', 'bedrooms', 'latitude', 'longitude', 'elevator',
       'cats_allowed', 'hardwood_floors', 'dogs_allowed', 'doorman',
       'dishwasher', 'no_fee', 'laundry_in_building', 'fitness_center',
       'pre-war', 'laundry_in_unit', 'roof_deck', 'outdoor_space',
       'dining_room', 'high_speed_internet', 'balcony', 'swimming_pool',
       'new_construction', 'terrace', 'exclusive', 'loft', 'garden_patio',
       'wheelchair_access', 'common_outdoor_space', 'interest_low',
       'interest_medium', 'total_rooms']

X_train = train[final_features]
X_test = test[final_features]

model.fit(X_train, y_train)
y_pred = model.predict(X_train)

find_errors('Train', y_train, y_pred)

Train Errors: 
----------------------
Mean Square Error: $1125427.16
Root Mean Square Error: $1060.86
Mean Absolute Error: $673.08
R Sqaured: 0.64


In [None]:
# Testing values
y_pred = model.predict(X_test)
find_errors('Test', y_test, y_pred)

Test Errors: 
----------------------
Mean Square Error: $1094022.66
Root Mean Square Error: $1045.96
Mean Absolute Error: $676.07
R Sqaured: 0.65
