<a href="https://colab.research.google.com/github/RAV10K1/DS-Unit-2-Linear-Models/blob/master/LS_DS_212_Regression_2_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [4]:
# Checking dataframe
df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [0]:
# Importing libraries
import plotly.express as px
from sklearn.linear_model import LinearRegression

In [6]:
# Sorting dataframe values by date
df1 = df.sort_values(by='created', ascending=True)
df1.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
5186,1.0,1,2016-04-01 22:12:41,X-LARGE Flex 1BR Loft! ~~ PRIME Greenwich Vill...,Astor Place,40.7302,-73.9924,3195,1 Astor Place,high,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7945,1.0,0,2016-04-01 22:56:00,"This Enormous Studio Features: Harwood Floors,...",East 54th Street,40.7576,-73.9677,2000,230 East 54th Street,medium,0,1,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6424,2.0,3,2016-04-01 22:57:15,--- East 31st St & Lexington Avenue --- This S...,East 31st St & Lexington Avenue,40.7388,-73.9851,5850,105 Lexington Avenue,high,1,1,0,1,1,0,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1
7719,1.0,1,2016-04-01 23:26:07,Reduced Fee!! Priced To Rent!\rLarge Newly Upd...,West End Ave,40.7939,-73.9738,2745,700 West End Ave,medium,1,1,0,1,1,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1
1723,1.0,1,2016-04-02 00:48:13,Phenomenal deal of the century!! This spacious...,E 88th street,40.7784,-73.9491,2400,401 E 88th street,medium,1,1,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [7]:
# Creating new column for total number of amenities
df1['total_amenities'] = df1['elevator'] + df1['cats_allowed'] + df1['hardwood_floors'] + df1['dogs_allowed']
+ df1['doorman'] + df1['dishwasher'] + df1['no_fee'] + df1['laundry_in_building']
+ df1['fitness_center'] + df1['pre-war'] + df1['laundry_in_unit'] + df1['roof_deck']
+ df1['dining_room'] + df1['high_speed_internet'] + df1['balcony'] + df1['swimming_pool']
+ df1['new_construction'] + df1['terrace'] + df1['exclusive'] + df1['loft']
+ df1['garden_patio'] + df1['wheelchair_access'] + df1['common_outdoor_space']

5186     0
7945     0
6424     1
7719     1
1723     0
        ..
19943    0
16801    0
32633    0
20560    0
17743    0
Length: 48817, dtype: int64

In [0]:
# Creating new column for total number of rooms
df1['total_rooms'] = df1['bedrooms'] + df1['bathrooms']

In [9]:
# Creating train dataset
train = df1[df1['created'] < '2016-06-01 00:00:00']
print(train.shape)
train.head() 

(31844, 36)


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,total_amenities,total_rooms
5186,1.0,1,2016-04-01 22:12:41,X-LARGE Flex 1BR Loft! ~~ PRIME Greenwich Vill...,Astor Place,40.7302,-73.9924,3195,1 Astor Place,high,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2.0
7945,1.0,0,2016-04-01 22:56:00,"This Enormous Studio Features: Harwood Floors,...",East 54th Street,40.7576,-73.9677,2000,230 East 54th Street,medium,0,1,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1.0
6424,2.0,3,2016-04-01 22:57:15,--- East 31st St & Lexington Avenue --- This S...,East 31st St & Lexington Avenue,40.7388,-73.9851,5850,105 Lexington Avenue,high,1,1,0,1,1,0,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,3,5.0
7719,1.0,1,2016-04-01 23:26:07,Reduced Fee!! Priced To Rent!\rLarge Newly Upd...,West End Ave,40.7939,-73.9738,2745,700 West End Ave,medium,1,1,0,1,1,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,3,2.0
1723,1.0,1,2016-04-02 00:48:13,Phenomenal deal of the century!! This spacious...,E 88th street,40.7784,-73.9491,2400,401 E 88th street,medium,1,1,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,2.0


In [10]:
# Creating test dataset
test = df1[df1['created'] > '2016-06-01 00:00:00']
print(test.shape)
test.head()

(16973, 36)


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,total_amenities,total_rooms
11474,1.0,1,2016-06-01 01:10:37,beautiful one bedroom in Williamsburg fit for 1,Montrose Avenue,40.7071,-73.9475,2150,61 Montrose Avenue,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0
19176,1.0,1,2016-06-01 01:11:06,"Welcome home to charming, tree-lined Quincy St...",Quincy Street,40.6877,-73.945,2795,387 Quincy Street,low,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2.0
16226,1.0,2,2016-06-01 01:11:12,A beautiful 2 bedroom Apartment in the East Vi...,Saint Marks Place,40.7291,-73.9895,3495,8 Saint Marks Place,low,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,3.0
37756,1.0,0,2016-06-01 01:11:52,"Nice studio, hardwood floors, exposed brick, d...",East 2nd Street,40.7213,-73.9822,2100,237 East 2nd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1.0
17946,1.0,2,2016-06-01 01:12:22,Great bright and spacious 2 bedrooms two bathr...,East 118th Street,40.7986,-73.9372,2900,249 East 118th Street,low,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,3.0


In [11]:
# Visualizing Bedrooms and Price columns
px.scatter(df1, x='bedrooms', y='price', trendline='ols')

As seen in the previous assignment the number of bedrooms has a linear relationship to the rent price of an apartment.

In [12]:
# Visualizing Bathrooms and Price columns
px.scatter(df1, x='bathrooms', y='price', trendline='ols')

The number of bathrooms in an apartment also shows a linear relationship with the rent price.

In [13]:
# Visualizing total rooms and price columns
px.scatter(df1, x='total_rooms', y='price', trendline='ols')

In [15]:
# Visualizing total amenities and price columns
px.scatter(df1, x='total_amenities', y='price', trendline='ols')

Upon exploratory data analysis it appears that the best dependent variables to use for predicting apartment rent prices would be the total number of rooms and total number of amenities as they show strong linear relationships.

In [17]:
# Instantiating Linear Regression model
model = LinearRegression()
model

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [0]:
# Defining features and target
features = ['total_rooms', 'total_amenities']
target = 'price'
X_train = train[features]
y_train = train[target]

In [0]:
# Defining y_test
y_test = test[target]

In [20]:
# Fitting data to instantiated model
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [22]:
# Applying model to new data
rooms, amenities = 3, 2
X_test = [rooms, amenities]
y_pred = model.predict([X_test])
y_pred

array([3803.24057449])

In [24]:
# Calculating coefficient and intercept
m0 = model.coef_[0]
b0 = model.intercept_
print(f'Coefficient of the model is ${m0:,.2f}')
print(f'Intercept of the model is ${b0:,.2f}')

Coefficient of the model is $801.30
Intercept of the model is $991.58


In this model each additional room/bathroom and amenity adds $ 801.30 to the rent price.

In [0]:
# Defining function for prediction of rent prices based on total number of rooms and amenities
def predict(total_rooms, total_amenities):
  y_pred = model.predict([[total_rooms, total_amenities]])
  estimate = y_pred[0]
  coefficient = model.coef_[0]
  result = f'${estimate:,.2f} is the estimated rent price for a {total_rooms:,.0f} room apartment in NYC with {total_amenities:,.0f} amenities.'
  explanation = f' In this linear regression model, each additional bedroom and amenity adds ${coefficient:,.2f} to the rent price.'
  return result + explanation

In [36]:
# Predicting the price for a 3 bedroom apartment with 4 amenities
print(predict(3,3))

$4,007.13 is the estimated rent price for a 3 room apartment in NYC with 3 amenities. In this linear regression model, each additional bedroom and amenity adds $801.30 to the rent price.


In [46]:
# Calculating Mean Absolute Error for train data
from sklearn.metrics import mean_absolute_error
y_pred = [y_train.mean()] * len(y_train)
mae = mean_absolute_error(y_train, y_pred)
print(f'Train Mean Absolute Error (apartment rent prices): ${mae:.2f} per month')

Train Mean Absolute Error (apartment rent prices): $1201.88 per month


In [47]:
# Calculating Mean Absolute Error for test data
y_pred = [y_test.mean()] * len(y_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'Test Mean Absolute Error (apartment rent price): ${mae:.2f} per month')

Test Mean Absolute Error (apartment rent price): $1200.87 per month


In [49]:
# Calculating Mean Squared Error and Root Mean Squared Error for train data
from sklearn.metrics import mean_squared_error
y_pred = [y_train.mean()] * len(y_train)
rmse = mean_squared_error(y_train, y_pred)
print(f'Train Mean Squared Error (apartment rent price): ${rmse:.2f} per month')
print(f'Train Root Mean Squared Error (apartment rent price); ${np.sqrt(rmse):.2f} per month')

Train Mean Squared Error (apartment rent price): $3105028.22 per month
Train Root Mean Squared Error (apartment rent price); $1762.11 per month


In [50]:
# Calculating Mean Squared Error and Root Mean Squared Error for test data
from sklearn.metrics import mean_squared_error
y_pred = [y_test.mean()] * len(y_test)
rmse = mean_squared_error(y_test, y_pred)
print(f'Test Mean Squared Error (apartment rent price): ${rmse:.2f} per month')
print(f'Test Root Mean Squared Error (apartment rent price); ${np.sqrt(rmse):.2f} per month')

Test Mean Squared Error (apartment rent price): $3108021.27 per month
Test Root Mean Squared Error (apartment rent price); $1762.96 per month


In [53]:
# Calculating R2 score for train data
from sklearn.metrics import r2_score
y_pred = [y_train.mean()] * len(y_train)
r2score = r2_score(y_train, y_pred)
print(f'Train R2 score (apartment rent price): ${r2score:.2f}')

Train R2 score (apartment rent price): $0.00


In [54]:
# Calculating R2 score for test data
from sklearn.metrics import r2_score
y_pred = [y_test.mean()] * len(y_test)
r2score = r2_score(y_test, y_pred)
print(f'Test R2 score (apartment rent price): ${r2score:.2f}')

Test R2 score (apartment rent price): $0.00
