Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

* \[ \] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.

* \[ \] Engineer at least two new features. (See below for explanation & ideas.)

* \[ \] Fit a linear regression model with at least two features.

* \[ \] Get the model's coefficients and intercept.

* \[ \] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.

* \[ \] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!

* \[ \] As always, commit your notebook to your fork of the GitHub repo.

#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/\~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf)

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.

#### Feature Ideas

* Does the apartment have a description?

* How long is the description?

* How many total perks does each apartment have?

* Are cats *or* dogs allowed?

* Are cats *and* dogs allowed?

* Total number of rooms (beds + baths)

* Ratio of beds to baths

* What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals

* \[ \] If you want more math, skim *[An Introduction to Statistical Learning](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf)*,  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression

* \[ \] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
  (20 minutes, over 1 million views)

* \[ \] Add your own stretch goal(s) !

In [1]:
%%capture
import sys

DATA_PATH = 'https://dsets.scurlock.me/DS13U2/apartments/renthop-nyc.csv'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [2]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH, infer_datetime_format=True)
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [3]:
df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,...,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,...,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,...,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,...,0,0,0,0,0,0,0,0,0,0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,...,0,0,0,0,0,0,0,0,0,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,...,0,0,0,0,0,0,0,0,0,0


In [4]:
df.tail()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,...,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
49347,1.0,2,2016-06-02 05:41:05,"30TH/3RD, MASSIVE CONV 2BR IN LUXURY FULL SERV...",E 30 St,40.7426,-73.979,3200,230 E 30 St,medium,...,0,0,0,0,0,0,0,0,0,0
49348,1.0,1,2016-04-04 18:22:34,"HIGH END condo finishes, swimming pool, and ki...",Rector Pl,40.7102,-74.0163,3950,225 Rector Place,low,...,0,0,0,0,0,1,0,0,0,1
49349,1.0,1,2016-04-16 02:13:40,Large Renovated One Bedroom Apartment with Sta...,West 45th Street,40.7601,-73.99,2595,341 West 45th Street,low,...,0,0,0,0,0,0,0,0,0,0
49350,1.0,0,2016-04-08 02:13:33,Stylishly sleek studio apartment with unsurpas...,Wall Street,40.7066,-74.0101,3350,37 Wall Street,low,...,0,0,0,0,0,0,0,0,0,0
49351,1.0,2,2016-04-12 02:48:07,Look no further!!! This giant 2 bedroom apart...,Park Terrace East,40.8699,-73.9172,2200,30 Park Terrace East,low,...,0,0,0,0,0,0,0,0,0,0


In [5]:
df.dtypes

bathrooms               float64
bedrooms                  int64
created                  object
description              object
display_address          object
latitude                float64
longitude               float64
price                     int64
street_address           object
interest_level           object
elevator                  int64
cats_allowed              int64
hardwood_floors           int64
dogs_allowed              int64
doorman                   int64
dishwasher                int64
no_fee                    int64
laundry_in_building       int64
fitness_center            int64
pre-war                   int64
laundry_in_unit           int64
roof_deck                 int64
outdoor_space             int64
dining_room               int64
high_speed_internet       int64
balcony                   int64
swimming_pool             int64
new_construction          int64
terrace                   int64
exclusive                 int64
loft                      int64
garden_p

In [6]:
# Firs thing I want to do is deal with the date column
# getting it in to the right format

df['created'] = df['created'].astype('datetime64[ns]')

In [7]:
df.dtypes

bathrooms                      float64
bedrooms                         int64
created                 datetime64[ns]
description                     object
display_address                 object
latitude                       float64
longitude                      float64
price                            int64
street_address                  object
interest_level                  object
elevator                         int64
cats_allowed                     int64
hardwood_floors                  int64
dogs_allowed                     int64
doorman                          int64
dishwasher                       int64
no_fee                           int64
laundry_in_building              int64
fitness_center                   int64
pre-war                          int64
laundry_in_unit                  int64
roof_deck                        int64
outdoor_space                    int64
dining_room                      int64
high_speed_internet              int64
balcony                  

In [13]:
df.columns

Index(['bathrooms', 'bedrooms', 'created', 'description', 'display_address',
       'latitude', 'longitude', 'price', 'street_address', 'interest_level',
       'elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed',
       'doorman', 'dishwasher', 'no_fee', 'laundry_in_building',
       'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck',
       'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony',
       'swimming_pool', 'new_construction', 'terrace', 'exclusive', 'loft',
       'garden_patio', 'wheelchair_access', 'common_outdoor_space', 'year',
       'month_name', 'month', 'day', 'day_name', 'total_rooms'],
      dtype='object')

In [8]:
df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,...,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,...,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,...,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,...,0,0,0,0,0,0,0,0,0,0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,...,0,0,0,0,0,0,0,0,0,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,...,0,0,0,0,0,0,0,0,0,0


### 1. Engineer at least two new features.

I am going to create some features I may need later. I may not need them all, but just in case, I will get them out of the way. The 2

new features are total rooms, and the geocoding (if it gets done in time). I will add a few more features here just in case.

In [14]:
# Year
df['year'] = df['created'].dt.year

# Month name
df['month_name'] = df['created'].dt.month_name()

# Month
df['month'] = df['created'].dt.month

# Day
df['day'] = df['created'].dt.day

# Day Name
df['day_name'] = df['created'].dt.day_name()

# Total rooms
df['total_rooms'] = df.apply(lambda row: row['bathrooms'] + row['bedrooms'], axis=1)

# Number of perks

perks =['elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed', 
        'doorman', 'dishwasher', 'no_fee', 'laundry_in_building', 
        'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck',
        'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony', 
        'swimming_pool', 'new_construction', 'terrace', 'exclusive', 'loft', 
        'garden_patio', 'wheelchair_access', 'common_outdoor_space']


tmp = []
for index, row in df.iterrows():
    count = 0
    for column in perks:
        count += row[column]
    tmp.append(count)    
    
df['total_perks'] = tmp

### Finding Neighborhoods based on lon and lat

Found a geojson file that contains that information. Will search it, and add a feature for neighborhood

In [10]:
# (see src/geocoder.py) 
# This is being done outside of the notebook 
# using a geojson file and geopandas, but is 
# taking a lot of time, so I will move on until
# it is finished, and if I have enough time, I
# will try to


In [15]:
# Take a peek
df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,...,garden_patio,wheelchair_access,common_outdoor_space,year,month_name,month,day,day_name,total_rooms,total_perks
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,...,0,0,0,2016,June,6,24,Friday,4.5,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,...,0,0,0,2016,June,6,12,Sunday,3.0,5
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,...,0,0,0,2016,April,4,17,Sunday,2.0,3
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,...,0,0,0,2016,April,4,18,Monday,2.0,2
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,...,0,0,0,2016,April,4,28,Thursday,5.0,1


In [32]:
# Did it get the dtypes right?
gdf.dtypes

ntacode          object
shape_area       object
county_fips      object
ntaname          object
shape_leng       object
boro_name        object
boro_code        object
geometry       geometry
dtype: object

### 2. Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.

In [39]:
train = df[(df['month_name'] == 'April') ^ 
           (df['month_name'] =='May')  & 
           (df['year'] == 2016)]

test = df[(df['month_name'] == 'June') & 
          (df['year'] == 2016)]

In [40]:
# Look at train
train.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,...,garden_patio,wheelchair_access,common_outdoor_space,year,month_name,month,day,day_name,total_rooms,total_perks
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,...,0,0,0,2016,April,4,17,Sunday,2.0,3
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,...,0,0,0,2016,April,4,18,Monday,2.0,2
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,...,0,0,0,2016,April,4,28,Thursday,5.0,1
5,2.0,4,2016-04-19 04:24:47,,West 18th Street,40.7429,-74.0028,7995,350 West 18th Street,medium,...,0,0,0,2016,April,4,19,Tuesday,6.0,0
6,1.0,2,2016-04-27 03:19:56,Stunning unit with a great location and lots o...,West 107th Street,40.8012,-73.966,3600,210 West 107th Street,low,...,0,0,0,2016,April,4,27,Wednesday,3.0,3


In [41]:
# Look at test
test.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,...,garden_patio,wheelchair_access,common_outdoor_space,year,month_name,month,day,day_name,total_rooms,total_perks
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,...,0,0,0,2016,June,6,24,Friday,4.5,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,...,0,0,0,2016,June,6,12,Sunday,3.0,5
11,1.0,1,2016-06-03 03:21:22,Check out this one bedroom apartment in a grea...,W. 173rd Street,40.8448,-73.9396,1675,644 W. 173rd Street,low,...,0,0,0,2016,June,6,3,Friday,2.0,0
14,1.0,1,2016-06-01 03:11:01,Spacious 1-Bedroom to fit King-sized bed comfo...,East 56th St..,40.7584,-73.9648,3050,315 East 56th St..,low,...,0,0,0,2016,June,6,1,Wednesday,2.0,3
24,2.0,4,2016-06-07 04:39:56,SPRAWLING 2 BEDROOM FOUND! ENJOY THE LUXURY OF...,W 18 St.,40.7391,-73.9936,7400,30 W 18 St.,medium,...,0,0,0,2016,June,6,7,Tuesday,6.0,11


### 3. Fit a linear regression model with at least two features

In [51]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error


# Features, just using two from above
# total_perks, and total_rooms
features = ['total_perks', 'total_rooms']
target = ['price']

y_train = train[target]
y_test = test[target]

# Baseline
print('Baseline with no features')
guess = y_train.mean()
print(guess)

Baseline with no features
price    3575.604007
dtype: float64


In [54]:
# Train/Test error
##################
# Train
y_pred = [guess] * len(y_train)
mae = mean_absolute_error(y_train, y_pred)
print(f'Train Error: {mae}')


Train Error: 1201.8811133682555


In [60]:
# Test 
y_pred = [guess] * len(y_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'Test Error: {mae:}')

Test Error: 1197.7088871089013


In [61]:
# The model
model = LinearRegression()

In [62]:
# features 
features = features = ['total_perks', 'total_rooms']
X_train = train[features]
X_test = test[features]

In [64]:
# Fit 
model.fit(X_train, y_train)
y_pred = model.predict(X_train)
mae = mean_absolute_error(y_train, y_pred)
print(f'Train Error: {mae:.2f}')

Train Error: 856.62


In [65]:
# Try with new data
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'Test Error: {mae:.2f}')

Test Error: 868.00


### 4. Get the model's coefficients and intercept.

In [72]:
model.intercept_, model.coef_

(array([998.19463411]), array([[101.76435155, 767.95662272]]))

### 5. Get regression metrics RMSE, MAE, and 𝑅2, for both the train and test data.

What's the best test MAE you can get? Share your score and features used with your cohort on Slack!

##### Need to remind myself:

* **MAE** (Mean absolute error) represents the difference between the original and predicted values extracted by averaged the absolute difference over the data set.

* **MSE** (Mean Squared Error) represents the difference between the original and predicted values extracted by squared the average difference over the data set.

* **RMSE** (Root Mean Squared Error) is the error rate by the square root of MSE.

* **R-squared** (Coefficient of determination) represents the coefficient of how well the values fit compared to the original values. The value from 0 to 1 interpreted as percentages. The higher the value is, the better the model is.