Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [2]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [56]:
# Other imports
from sklearn.metrics import mean_absolute_error
import plotly.express as px

### **Engineer at least two new features. (See below for explanation & ideas.)**

In [3]:
# Look at the data
df

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49347,1.0,2,2016-06-02 05:41:05,"30TH/3RD, MASSIVE CONV 2BR IN LUXURY FULL SERV...",E 30 St,40.7426,-73.9790,3200,230 E 30 St,medium,1,0,1,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
49348,1.0,1,2016-04-04 18:22:34,"HIGH END condo finishes, swimming pool, and ki...",Rector Pl,40.7102,-74.0163,3950,225 Rector Place,low,1,1,0,1,1,0,0,1,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1
49349,1.0,1,2016-04-16 02:13:40,Large Renovated One Bedroom Apartment with Sta...,West 45th Street,40.7601,-73.9900,2595,341 West 45th Street,low,1,1,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
49350,1.0,0,2016-04-08 02:13:33,Stylishly sleek studio apartment with unsurpas...,Wall Street,40.7066,-74.0101,3350,37 Wall Street,low,1,1,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


My new features would be:

1. Number of Features (sum of beneficial building features)
2. Number of Rooms (sum of bedrooms and bathrooms and dining room)

In [4]:
# Number of features (perks)
df['perks'] = df[['elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed',
           'doorman', 'dishwasher', 'no_fee', 'laundry_in_building',
           'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck',
           'outdoor_space', 'high_speed_internet', 'balcony',
           'swimming_pool', 'new_construction', 'terrace', 'exclusive', 'loft',
           'garden_patio', 'wheelchair_access', 'common_outdoor_space']].sum(axis=1)

In [5]:
# Number of rooms
df['rooms'] = df[['bedrooms','bathrooms','dining_room']].sum(axis=1)

In [6]:
# Confirm
df.sample()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,perks,rooms
41863,1.0,1,2016-04-24 04:11:43,Industrial finishes throughout - all apartment...,Desbrosses,40.7236,-74.0106,4250,34 Desbrosses,low,1,0,1,0,1,1,1,0,1,0,1,1,0,0,1,1,0,1,0,0,0,1,0,0,12,2.0


In [7]:
# At this point I decided to trim the dataframe to features I am most most likely to use
NYCA = df.filter(items=['created','price','rooms','perks'])

### **Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.**

In [8]:
# First I look at the data
NYCA

Unnamed: 0,created,price,rooms,perks
0,2016-06-24 07:54:24,3000,4.5,0
1,2016-06-12 12:19:27,5465,3.0,5
2,2016-04-17 03:26:41,2850,2.0,3
3,2016-04-18 02:22:02,3275,2.0,2
4,2016-04-28 01:32:41,3350,5.0,1
...,...,...,...,...
49347,2016-06-02 05:41:05,3200,3.0,5
49348,2016-04-04 18:22:34,3950,2.0,9
49349,2016-04-16 02:13:40,2595,2.0,5
49350,2016-04-08 02:13:33,3350,1.0,5


In [9]:
# I wanted to know if the 'created' column was in datetime
# Since it isn't I decided to change that to make it easier to split
NYCA['created'].dtype

dtype('O')

In [10]:
pd.to_datetime(NYCA['created'])

0       2016-06-24 07:54:24
1       2016-06-12 12:19:27
2       2016-04-17 03:26:41
3       2016-04-18 02:22:02
4       2016-04-28 01:32:41
                ...        
49347   2016-06-02 05:41:05
49348   2016-04-04 18:22:34
49349   2016-04-16 02:13:40
49350   2016-04-08 02:13:33
49351   2016-04-12 02:48:07
Name: created, Length: 48817, dtype: datetime64[ns]

In [11]:
# creating subsets of train and test data to conditions
# check shapes to check for error
train = NYCA[NYCA['created'] < '2016-06-01 00:00:00']
test = NYCA[NYCA['created'] > '2016-05-31 23:59:59']
test.shape,train.shape
print('The shape of the "train" subset is', train.shape,
      '\nthe shape of test subset is', test.shape,
      '\nthe overall shape of the dataset is', df.shape, 
      '\nthe number of rows of the two sets together equal', 31844 + 16973,
      '. The split did not lose data')

The shape of the "train" subset is (31844, 4) 
the shape of test subset is (16973, 4) 
the overall shape of the dataset is (48817, 36) 
the number of rows of the two sets together equal 48817 . The split did not lose data


In [12]:
# Double check
train = train.set_index(train['created']).sort_index()
train.head(1).append(train.tail(1))

Unnamed: 0_level_0,created,price,rooms,perks
created,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016-04-01 22:12:41,2016-04-01 22:12:41,3195,2.0,3
2016-05-31 23:10:48,2016-05-31 23:10:48,3095,2.0,4


In [13]:
test = test.set_index(test['created']).sort_index()
test.head(1).append(test.tail(1))

Unnamed: 0_level_0,created,price,rooms,perks
created,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016-06-01 01:10:37,2016-06-01 01:10:37,2150,2.0,1
2016-06-29 21:41:47,2016-06-29 21:41:47,3399,4.0,0


In [14]:
# drop the extra column for posterity
train.reset_index(drop= True, inplace=True)
test.reset_index(drop= True,inplace= True)

In [15]:
# Check
train.head(1).append(test.head(1))

Unnamed: 0,created,price,rooms,perks
0,2016-04-01 22:12:41,3195,2.0,3
0,2016-06-01 01:10:37,2150,2.0,1


### **Fit a linear regression model with at least two features.**

In [20]:
# Beginning with mean baseline
aguess = round(df['price'].mean())
errs = aguess - df['price']
mae = errs.abs().mean()
print(f'If I guess that the average rent is ${aguess},')
print(f'I would be off by ${mae:,.0f} on average.')

If I guess that the average rent is $3580,
I would be off by $1,202 on average.


In [42]:
# Y target vectors
y_train = train['price']
y_test = test['price']

In [44]:
# Wrote a function rather than repeat actions
def finderr(guess,data):
  pred = [guess] * len(data)
  mae = mean_absolute_error(data,pred)
  print(f'On average, the error is: {mae:,.0f}')


In [54]:
print('Y-Train Error:')
finderr(aguess,y_train)

Y-Train Error:
On average, the error is: 1,203


In [55]:
print('Y-Test Error:')
finderr(aguess,y_test)

Y-Test Error:
On average, the error is: 1,199


In [57]:
# Scatterplot of two features and target
px.scatter_3d(train,
    x='perks', 
    y='rooms', 
    z='price', 
    text='created', 
    title='New York City Apartment Listings, 2016 May-April')

### **Get the model's coefficients and intercept.**

### **Get regression metrics RMSE, MAE, and R2, for both the train and test data.**

### **What's the best test MAE you can get?**