Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [0]:
df['created'] = pd.to_datetime(df['created'])
df.sort_values(by=['created'], inplace=True)
print(df.shape)
df.tail()

(48817, 34)


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
19943,1.0,2,2016-06-29 17:47:34,"Newly renovated, clean, quiet & very bright 2...",At Thompson St,40.7261,-74.0014,3045,132 Thompson St #20,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
16801,1.0,1,2016-06-29 17:56:12,Historic conversion in Greenpoint with awesome...,"100 Dupont St, Brooklyn, NY 11222",40.7358,-73.956,2648,100 Dupont St,low,1,1,0,1,0,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
32633,1.0,1,2016-06-29 18:14:48,"Large, sunny one bedroom apartment with a sepa...",West 45th St. and 8th Ave.,40.7597,-73.989,2650,305 West 45th St.,low,1,1,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
20560,1.0,3,2016-06-29 18:30:41,LOOK NO FURTHER!!\r\rMassive three bedroom apa...,williamsburg,40.7156,-73.954,3200,32 Havemeyer Street,medium,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
17743,1.0,3,2016-06-29 21:41:47,LOOK NO FURTHER!!\r\rMassive three bedroom apa...,williamsburg,40.7156,-73.954,3399,32 Havemeyer Street,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [0]:
# df.info()
df.describe(include='all')

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,pet_level,amenity_score,rooms
count,48817.0,48817.0,48817,47392.0,48684,48817.0,48817.0,48817.0,48807,48817,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0
unique,,,48148,37853.0,8674,,,,15135,3,,,,,,,,,,,,,,,,,,,,,,,,,,,
top,,,2016-06-05 01:32:36,,Broadway,,,,3333 Broadway,low,,,,,,,,,,,,,,,,,,,,,,,,,,,
freq,,,3,1627.0,435,,,,174,33946,,,,,,,,,,,,,,,,,,,,,,,,,,,
first,,,2016-04-01 22:12:41,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
last,,,2016-06-29 21:41:47,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
mean,1.201794,1.537149,,,,40.75076,-73.97276,3579.585247,,,0.524838,0.478276,0.478276,0.447631,0.424852,0.415081,0.367085,0.052769,0.268452,0.185653,0.175902,0.132761,0.138394,0.102833,0.087203,0.060471,0.055206,0.051908,0.046193,0.043305,0.042711,0.039331,0.027224,0.026241,1.373538,3.327693,2.738943
std,0.470711,1.106087,,,,0.038954,0.028883,1762.430772,,,0.499388,0.499533,0.499533,0.497255,0.494326,0.492741,0.482015,0.223573,0.443158,0.38883,0.380741,0.33932,0.345317,0.303744,0.282136,0.238359,0.228385,0.221844,0.209905,0.203544,0.202206,0.194382,0.162738,0.159852,1.473028,2.898588,1.41093
min,0.0,0.0,,,,40.5757,-74.0873,1375.0,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,1.0,,,,40.7283,-73.9918,2500.0,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0


In [0]:
df.cov()

Unnamed: 0,bathrooms,bedrooms,latitude,longitude,price,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
bathrooms,0.221569,0.272862,0.000246,-0.000273,570.1778,0.031236,0.00505,0.02279,0.005744,0.036572,0.039947,0.029393,-0.001389,0.030942,-0.002968,0.037522,0.01631,0.02348,0.033319,0.011829,0.01537,0.012182,0.006978,0.013908,-0.000186,0.001357,0.00881,0.005128,-0.000256
bedrooms,0.272862,1.223428,0.00015,0.001761,1043.91,-0.013711,-0.006173,0.053274,-0.005363,-0.026005,0.082898,0.086904,0.002116,0.006446,0.000611,0.063924,0.017059,0.04771,0.063637,0.019038,0.025777,0.008448,0.000223,0.02302,-0.00335,-0.024008,0.015214,0.002215,0.000572
latitude,0.000246,0.00015,0.001517,0.00037,-2.491125,-0.000205,-0.00058,0.000363,-0.0006,-0.000836,-0.000396,-0.000503,-0.000363,-0.001872,0.000406,-0.000658,-0.000835,-0.001075,0.000203,-0.00034,0.000157,0.00024,-0.000485,5.3e-05,-0.000476,-0.000123,-4e-06,-0.000433,-0.00073
longitude,-0.000273,0.001761,0.00037,0.000834,-12.77702,-0.002738,-0.000844,-0.001565,-0.00101,-0.003937,-0.002408,-0.001226,-0.000288,-0.003263,2e-06,-0.001431,-0.001582,-0.000983,-0.000218,-0.001024,-0.000244,-0.000495,-0.000692,-0.000297,0.000275,-0.000351,-0.000168,-0.000299,-0.000475
price,570.177795,1043.91035,-2.491125,-12.777017,3106162.0,182.337233,45.298706,89.362292,52.933939,240.642842,194.438671,112.339937,-7.651049,178.681771,-19.956922,181.97973,73.514832,86.50965,130.03712,44.885869,58.451417,54.143288,27.92824,54.001501,-4.75361,2.530231,35.51662,20.798759,3.24466
elevator,0.031236,-0.013711,-0.000205,-0.002738,182.3372,0.249388,0.008319,0.067562,0.00865,0.15171,0.086083,0.054857,0.015753,0.095399,-0.018838,0.025508,0.056263,0.035238,0.030427,0.039122,0.020007,0.020947,0.020404,0.014186,0.002632,0.005546,0.008159,0.012629,0.009171
cats_allowed,0.00505,-0.006173,-0.00058,-0.000844,45.29871,0.008319,0.249533,-0.044325,0.232807,0.021848,-0.011785,-0.005791,0.011799,0.028089,0.009226,-0.001679,0.004677,0.011508,-0.003399,0.010959,0.002379,0.001107,0.004733,0.000645,0.003194,-0.003835,0.00065,0.003224,0.008344
hardwood_floors,0.02279,0.053274,0.000363,-0.001565,89.36229,0.067562,-0.044325,0.249533,-0.046118,0.050651,0.156295,0.082581,-0.016164,0.037082,0.002609,0.068605,0.047245,0.032587,0.048141,0.033534,0.021205,0.019462,0.020793,0.018979,-0.01977,0.011775,0.015706,0.010148,-0.009621
dogs_allowed,0.005744,-0.005363,-0.0006,-0.00101,52.93394,0.00865,0.232807,-0.046118,0.247263,0.023458,-0.010741,-0.002736,0.010343,0.028982,0.009979,0.000475,0.005754,0.011652,-0.00244,0.012546,0.002901,0.001099,0.006139,0.000545,0.003313,-0.004124,0.000892,0.003894,0.008431
doorman,0.036572,-0.026005,-0.000836,-0.003937,240.6428,0.15171,0.021848,0.050651,0.023458,0.244358,0.076168,0.061243,0.008534,0.132504,-0.010497,0.031318,0.065235,0.035105,0.029401,0.043519,0.018945,0.029786,0.02412,0.013171,-0.007521,0.001008,0.007299,0.013528,0.010012


### Engineer at least two new features. (See below for explanation & ideas.)

In [0]:
# create a feature for pets: shows what level of pets allowed in apartment
# pet_lvl
# 0 = no pets allowed
# 1 = only cats allowed
# 2 = only dogs allowed
# 3 = only cats and dogs allowed

def plvl(db):
    if db['cats_allowed'] == 0 and db['dogs_allowed'] == 0:
        val = 0
    elif db['cats_allowed'] == 1 and db['dogs_allowed'] == 0:
        val = 1
    elif db['cats_allowed'] == 0 and db['dogs_allowed'] == 1:
        val = 2
    elif db['cats_allowed'] == 1 and db['dogs_allowed'] == 1:
        val = 3
    return val

df['pet_level'] = df.apply(plvl, axis=1)

In [0]:
df['pet_level'].value_counts()

0    25433
3    21816
1     1532
2       36
Name: pet_level, dtype: int64

In [0]:
# Create a feature that scores amenities
# adds 1 to overall score for each amenity
# can have a score of 20

def amen_sco(db):
    amenities = ['elevator', 'hardwood_floors',
                 'doorman', 'dishwasher', 'laundry_in_building',
                 'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck',
                 'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony',
                 'swimming_pool', 'terrace', 'exclusive', 'loft',
                 'garden_patio', 'wheelchair_access', 'common_outdoor_space',]
    score = 0
    for amenity in amenities:
        if db[amenity] == 1:
            score = score + 1
    return score

df['amenity_score'] = df.apply(amen_sco, axis=1)

In [0]:
# creates a feature for total rooms in apartments
df['rooms'] = df['bathrooms'] + df['bedrooms']

In [0]:
# dum_mean.score(train, y_train)

###Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.

In [0]:
start_date = '2016-04-01'
end_date = '2016-05-31'
traindate = (df['created'] >= start_date) & (df['created'] <= end_date)
train = df.loc[traindate]
print(train.shape)
train.head()

(31839, 37)


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,pet_level,amenity_score,rooms
5186,1.0,1,2016-04-01 22:12:41,X-LARGE Flex 1BR Loft! ~~ PRIME Greenwich Vill...,Astor Place,40.7302,-73.9924,3195,1 Astor Place,high,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,2.0
7945,1.0,0,2016-04-01 22:56:00,"This Enormous Studio Features: Harwood Floors,...",East 54th Street,40.7576,-73.9677,2000,230 East 54th Street,medium,0,1,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,1,1.0
6424,2.0,3,2016-04-01 22:57:15,--- East 31st St & Lexington Avenue --- This S...,East 31st St & Lexington Avenue,40.7388,-73.9851,5850,105 Lexington Avenue,high,1,1,0,1,1,0,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,3,6,5.0
7719,1.0,1,2016-04-01 23:26:07,Reduced Fee!! Priced To Rent!\rLarge Newly Upd...,West End Ave,40.7939,-73.9738,2745,700 West End Ave,medium,1,1,0,1,1,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,3,6,2.0
1723,1.0,1,2016-04-02 00:48:13,Phenomenal deal of the century!! This spacious...,E 88th street,40.7784,-73.9491,2400,401 E 88th street,medium,1,1,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,3,2.0


In [0]:
start_date = '2016-06-01'
end_date = '2016-06-30'
testdate = (df['created'] >= start_date) & (df['created'] <= end_date)
test = df.loc[testdate]
print(test.shape)
test.head()

(16973, 37)


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,pet_level,amenity_score,rooms
11474,1.0,1,2016-06-01 01:10:37,beautiful one bedroom in Williamsburg fit for 1,Montrose Avenue,40.7071,-73.9475,2150,61 Montrose Avenue,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2.0
19176,1.0,1,2016-06-01 01:11:06,"Welcome home to charming, tree-lined Quincy St...",Quincy Street,40.6877,-73.945,2795,387 Quincy Street,low,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,1,2.0
16226,1.0,2,2016-06-01 01:11:12,A beautiful 2 bedroom Apartment in the East Vi...,Saint Marks Place,40.7291,-73.9895,3495,8 Saint Marks Place,low,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,1,3.0
37756,1.0,0,2016-06-01 01:11:52,"Nice studio, hardwood floors, exposed brick, d...",East 2nd Street,40.7213,-73.9822,2100,237 East 2nd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1.0
17946,1.0,2,2016-06-01 01:12:22,Great bright and spacious 2 bedrooms two bathr...,East 118th Street,40.7986,-73.9372,2900,249 East 118th Street,low,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,1,3.0


#### Baseline

In [0]:
target = 'price'
y_train = train[target]
y_test = test[target]

In [0]:
# Train Error
guess = y_train.mean()
from sklearn.metrics import mean_absolute_error
base_y_pred = [guess] * len(y_train)
mae = mean_absolute_error(y_train, base_y_pred)
print(f'Baseline currently off by ${mae: ,.2f} for New York apartment rent.')

Baseline currently off by $ 1,201.97 for New York apartment rent.


In [0]:
# Test Error
base_y_pred = [guess] * len(y_test)
mae = mean_absolute_error(y_test, base_y_pred)
print(f'Baseline currently off by ${mae: ,.2f} for New York apartment rent.')

Baseline currently off by $ 1,197.75 for New York apartment rent.


#### **Stretch: Baseline w/ DummyReg**

In [0]:
# I want to learn how to use this more I'm not sure if this would be the correct implementation

# from sklearn.dummy import DummyRegressor
# target = 'price'
# y_train = train[target]
# y_pred = test[target]

# dum_mean = DummyRegressor(strategy='mean')
# dum_mean.fit(train, y_train)
# dum_mean.predict(train)

### Fit a linear regression model with at least two features.

In [0]:
features = ['rooms', 'amenity_score', 'pet_level']

print(f'Linear Regression on {features}')
X_train = train[features]
X_test = test[features]

print(X_train.shape, X_test.shape)

Linear Regression on ['rooms', 'amenity_score', 'pet_level']
(31839, 3) (16973, 3)


In [0]:
from sklearn.linear_model import LinearRegression

# init LinearRegression
model = LinearRegression()

model.fit(X_train, y_train)
y_pred = model.predict(X_train)
mae = mean_absolute_error(y_train, y_pred)
print(f'Multi-Regression: off by ${mae:,.2f} for New York apartment rent.')

Multi-Regression: off by $852.33 for New York apartment rent.


### Get the model's coefficients and intercept.

In [0]:
model.coef_, model.intercept_

(array([766.78668519, 123.41881553,  53.99178266]), 991.5602799984563)

### Get regression metrics RMSE, MAE, and  R2 , for both the train and test data.

### What's the best test MAE you can get? Share your score and features used with your cohort on Slack!

In [0]:
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'Multi-Regression: off by ${mae:,.2f} for New York apartment rent.')

Multi-Regression: off by $862.67 for New York apartment rent.
