<a href="https://colab.research.google.com/github/Bhavani-Rajan/DS-Unit-2-Regression-Classification/blob/master/module2/assignment_regression_classification_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Regression & Classification, Module 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [1]:
import os, sys
in_colab = 'google.colab' in sys.modules

# If you're in Colab...
if in_colab:
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification.git
    !git pull origin master
    
    # Install required python packages
    !pip install -r requirements.txt
    
    # Change into directory for module
    os.chdir('module2')

Initialized empty Git repository in /content/.git/
remote: Enumerating objects: 156, done.[K
remote: Total 156 (delta 0), reused 0 (delta 0), pack-reused 156[K
Receiving objects: 100% (156/156), 19.30 MiB | 19.28 MiB/s, done.
Resolving deltas: 100% (71/71), done.
From https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification
 * branch            master     -> FETCH_HEAD
 * [new branch]      master     -> origin/master
Collecting category_encoders==2.0.0 (from -r requirements.txt (line 1))
[?25l  Downloading https://files.pythonhosted.org/packages/6e/a1/f7a22f144f33be78afeb06bfa78478e8284a64263a3c09b1ef54e673841e/category_encoders-2.0.0-py2.py3-none-any.whl (87kB)
[K     |████████████████████████████████| 92kB 3.2MB/s 
[?25hCollecting eli5==0.10.0 (from -r requirements.txt (line 2))
[?25l  Downloading https://files.pythonhosted.org/packages/e6/ea/47bd5844bb609d45821114aa7e0bc9e4422053fe24a6cf6b357f0d3f74d3/eli5-0.10.0-py2.py3-none-any.whl (105kB)
[K     |███████████████

In [0]:
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv('../data/apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [4]:
df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [5]:
df.dtypes

bathrooms               float64
bedrooms                  int64
created                  object
description              object
display_address          object
latitude                float64
longitude               float64
price                     int64
street_address           object
interest_level           object
elevator                  int64
cats_allowed              int64
hardwood_floors           int64
dogs_allowed              int64
doorman                   int64
dishwasher                int64
no_fee                    int64
laundry_in_building       int64
fitness_center            int64
pre-war                   int64
laundry_in_unit           int64
roof_deck                 int64
outdoor_space             int64
dining_room               int64
high_speed_internet       int64
balcony                   int64
swimming_pool             int64
new_construction          int64
terrace                   int64
exclusive                 int64
loft                      int64
garden_p

In [0]:
df['created'].value_counts()

In [0]:
df['date'] = pd.to_datetime(df['created'])

In [25]:
df['date'].dt.year.value_counts()

2016    48817
Name: date, dtype: int64

#### Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.

In [0]:
train = df[((df['date'].dt.month == 4 ) | (df['date'].dt.month == 5 ))]
test = df[(df['date'].dt.month == 6)]

In [32]:
len(train),len(test),len(df)

(31844, 16973, 48817)

#### Engineer at least two new features. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?


In [33]:
df.dtypes

bathrooms                      float64
bedrooms                         int64
created                         object
description                     object
display_address                 object
latitude                       float64
longitude                      float64
price                            int64
street_address                  object
interest_level                  object
elevator                         int64
cats_allowed                     int64
hardwood_floors                  int64
dogs_allowed                     int64
doorman                          int64
dishwasher                       int64
no_fee                           int64
laundry_in_building              int64
fitness_center                   int64
pre-war                          int64
laundry_in_unit                  int64
roof_deck                        int64
outdoor_space                    int64
dining_room                      int64
high_speed_internet              int64
balcony                  

In [34]:
df['description'].head(10)

0    A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...
1                                                     
2    Top Top West Village location, beautiful Pre-w...
3    Building Amenities - Garage - Garden - fitness...
4    Beautifully renovated 3 bedroom flex 4 bedroom...
5                                                     
6    Stunning unit with a great location and lots o...
7    This huge sunny ,plenty of lights 1 bed/2 bath...
8                             <p><a  website_redacted 
9    This is a spacious four bedroom with every bed...
Name: description, dtype: object

In [35]:
df['cats_allowed'].head()

0    0
1    1
2    0
3    0
4    0
Name: cats_allowed, dtype: int64

In [37]:
df['dogs_allowed'].head()

0    0
1    1
2    0
3    0
4    0
Name: dogs_allowed, dtype: int64

In [0]:
df['total_rooms'] = df['bedrooms'] + df['bathrooms']

In [39]:
df['total_rooms'].head()

0    4.5
1    3.0
2    2.0
3    2.0
4    5.0
Name: total_rooms, dtype: float64

In [40]:
df['bedrooms']/df['bathrooms']

0        2.000000
1        2.000000
2        1.000000
3        1.000000
4        4.000000
5        2.000000
6        2.000000
7        0.500000
8        1.000000
9        2.000000
10       0.000000
11       1.000000
12       2.000000
13       1.000000
14       1.000000
15       0.000000
16       2.000000
17       1.000000
18       4.000000
19       1.000000
20       3.000000
21       2.000000
22       1.000000
23       1.000000
24       2.000000
25       1.142857
26       1.000000
27       1.000000
29       1.000000
30       0.000000
           ...   
49322    1.000000
49323    1.000000
49324    0.000000
49325    2.000000
49326    1.000000
49327    1.000000
49328    1.000000
49329    2.000000
49330    1.000000
49331    0.000000
49332    2.000000
49333    1.000000
49334    1.000000
49335    1.000000
49336    1.500000
49337    0.000000
49338    1.500000
49339    3.000000
49340    1.000000
49341    0.000000
49342    1.000000
49343    1.000000
49344    1.000000
49345    1.000000
49346    1

#### Fit a linear regression model with at least two features.

In [54]:
# 1. Import the appropriate estimator class from Scikit-Learn
from sklearn.linear_model import LinearRegression


# 2. Instantiate this class
model = LinearRegression()

# 3. Arrange X features matrices  
features = ['bedrooms','bathrooms']
X_train = train[features]
X_test = test[features]


# Arrange y target vectors
target = ['price']
y_train = train[target]
y_test = test[target]

print(f'Linear Regression, dependent on: {features} to predict {target}' )


# 4. Fit the model
model.fit(X_train,y_train)


Linear Regression, dependent on: ['bedrooms', 'bathrooms'] to predict ['price']


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

#### Get the model's coefficients and intercept.

In [45]:
model.coef_,model.intercept_

(array([[ 389.3248959 , 2072.61011639]]), array([485.71869002]))

In [49]:
beta0 = model.intercept_
beta1 = model.coef_[0,0]
beta2 = model.coef_[0,1]
print(f'y = {beta0} + {beta1}x1 + {beta2}x2')

y = [485.71869002] + 389.32489590255824x1 + 2072.6101163851895x2


In [52]:
# This is easier to read
print('Intercept', model.intercept_)

print('coefficients :' ,model.coef_)

Intercept [485.71869002]
coefficients : [[ 389.3248959  2072.61011639]]


#### Get regression metrics RMSE, MAE, and  𝑅2 , for both the train and test data.

In [62]:
from sklearn.metrics import mean_squared_error
from math import sqrt

y_train_pred = model.predict(X_train)
train_mse = mean_squared_error(y_train,y_train_pred)
train_rmse = sqrt(train_mse)
print(f'Train RMSE Error: {train_rmse:.2f} ')

# Apply the model to new data
y_test_pred = model.predict(X_test)
test_mse = mean_squared_error(y_test,y_test_pred)
test_rmse = sqrt(test_mse)
print(f'Test RMSE Error: {test_rmse:.2f}')

Train RMSE Error: 1232.02 
Test RMSE Error: 1219.72


In [63]:
from sklearn.metrics import mean_absolute_error

y_train_pred = model.predict(X_train)
train_mae = mean_absolute_error(y_train,y_train_pred)
print(f'Train MAE Error: {train_mae:.2f} dollars')

# Apply the model to new data
y_test_pred = model.predict(X_test)
test_mae = mean_absolute_error(y_test,y_test_pred)
print(f'Test MAE Error: {test_mae:.2f} dollars')

Train MAE Error: 818.53 dollars
Test MAE Error: 825.90 dollars


In [65]:
from sklearn.metrics import r2_score

y_train_pred = model.predict(X_train)
train_r2 = r2_score(y_train,y_train_pred)
print(f'Train R2 Score: {train_r2:.2f} ')

# Apply the model to new data
y_test_pred = model.predict(X_test)
test_r2 = r2_score(y_test,y_test_pred)
print(f'Test R2 Score: {test_r2:.2f} ')

Train R2 Score: 0.51 
Test R2 Score: 0.52 
