Lambda School Data Science, Unit 2: Predictive Modeling

# Regression & Classification, Module 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [1]:
import os, sys
in_colab = 'google.colab' in sys.modules

# If you're in Colab...
if in_colab:
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification.git
    !git pull origin master
    
    # Install required python packages
    !pip install -r requirements.txt
    
    # Change into directory for module
    os.chdir('module2')

Initialized empty Git repository in /content/.git/
remote: Enumerating objects: 156, done.[K
remote: Total 156 (delta 0), reused 0 (delta 0), pack-reused 156[K
Receiving objects: 100% (156/156), 19.30 MiB | 14.15 MiB/s, done.
Resolving deltas: 100% (71/71), done.
From https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification
 * branch            master     -> FETCH_HEAD
 * [new branch]      master     -> origin/master
Collecting category_encoders==2.0.0 (from -r requirements.txt (line 1))
[?25l  Downloading https://files.pythonhosted.org/packages/6e/a1/f7a22f144f33be78afeb06bfa78478e8284a64263a3c09b1ef54e673841e/category_encoders-2.0.0-py2.py3-none-any.whl (87kB)
[K     |████████████████████████████████| 92kB 10.5MB/s 
[?25hCollecting eli5==0.10.0 (from -r requirements.txt (line 2))
[?25l  Downloading https://files.pythonhosted.org/packages/e6/ea/47bd5844bb609d45821114aa7e0bc9e4422053fe24a6cf6b357f0d3f74d3/eli5-0.10.0-py2.py3-none-any.whl (105kB)
[K     |██████████████

In [0]:
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv('../data/apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [4]:
#Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.

df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [8]:
#'created' is well-formatted but isn't a datetime yet
df['created'] = df['created'].apply(pd.to_datetime)

df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [10]:
#create month column for train/test split
df['month'] = df['created'].apply(lambda d: d.month)

df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,month
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4


In [13]:
#split based on month value
train = df[df['month'] < 6]
test = df[df['month'] == 6]

print(train.shape)
train.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,month
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4
5,2.0,4,2016-04-19 04:24:47,,West 18th Street,40.7429,-74.0028,7995,350 West 18th Street,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4
6,1.0,2,2016-04-27 03:19:56,Stunning unit with a great location and lots o...,West 107th Street,40.8012,-73.966,3600,210 West 107th Street,low,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4


In [14]:
test.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,month
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6
11,1.0,1,2016-06-03 03:21:22,Check out this one bedroom apartment in a grea...,W. 173rd Street,40.8448,-73.9396,1675,644 W. 173rd Street,low,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6
14,1.0,1,2016-06-01 03:11:01,Spacious 1-Bedroom to fit King-sized bed comfo...,East 56th St..,40.7584,-73.9648,3050,315 East 56th St..,low,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6
24,2.0,4,2016-06-07 04:39:56,SPRAWLING 2 BEDROOM FOUND! ENJOY THE LUXURY OF...,W 18 St.,40.7391,-73.9936,7400,30 W 18 St.,medium,1,1,1,1,1,1,0,0,1,0,0,0,1,0,1,1,0,0,1,0,0,0,0,0,6


In [25]:
#Fit a linear regression model with at least two features.
#Get regression metrics RMSE, MAE, and 𝑅2, for both the train and test data.
#Get the model's coefficients and intercept.
#two non-engineered features

# 1. Import the appropriate estimator class from Scikit-Learn
from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

# 2. Instantiate this class
model1 = LinearRegression()

# 3. Arrange X features matrix & y target vector
features = ['bathrooms','bedrooms']
target = 'price'
X_train = train[features]
y_train = train[target]

# 4. Fit the model
model1.fit(X_train, y_train)
y_pred = model1.predict(X_train)
print(f'Model coefficient for bathrooms is {model1.coef_[0]}')
print(f'Model coefficient for bedrooms is {model1.coef_[1]}')
print(f'Model intercept is {model1.intercept_}')
print('\n')

mae = mean_absolute_error(y_train, y_pred)
rmse = np.sqrt(mean_squared_error(y_train, y_pred))
r2 = r2_score(y_train, y_pred)
print(f'Train Root Mean Squared Error: {rmse}')
print(f'Train Mean Absolute Error: {mae}')
print(f'Train R2 score: {r2}')

print('\n')

# 5. Apply the model to new data
X_test = test[features]
y_test = test[target]
y_pred = model1.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f'Test Root Mean Squared Error: {rmse}')
print(f'Test Mean Absolute Error: {mae}')
print(f'Test R2 score: {r2}')

Model coefficient for bathrooms is 2072.610116385187
Model coefficient for bedrooms is 389.32489590255614
Model intercept is 485.71869002322865


Train Root Mean Squared Error: 1232.0225917223484
Train Mean Absolute Error: 818.5310213271714
Train R2 score: 0.5111543084316607


Test Root Mean Squared Error: 1219.719357233823
Test Mean Absolute Error: 825.8987822403527
Test R2 score: 0.5213303957090345


In [29]:
#Engineer at least two new features.
#new feature one: "has at least one of the features that makes me say 'ooh, fancy'"

test['fancy'] = test['doorman'] | test['high_speed_internet'] | test['exclusive'] | test['swimming_pool']

test.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,month,fancy
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,1
11,1.0,1,2016-06-03 03:21:22,Check out this one bedroom apartment in a grea...,W. 173rd Street,40.8448,-73.9396,1675,644 W. 173rd Street,low,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,0
14,1.0,1,2016-06-01 03:11:01,Spacious 1-Bedroom to fit King-sized bed comfo...,East 56th St..,40.7584,-73.9648,3050,315 East 56th St..,low,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,0
24,2.0,4,2016-06-07 04:39:56,SPRAWLING 2 BEDROOM FOUND! ENJOY THE LUXURY OF...,W 18 St.,40.7391,-73.9936,7400,30 W 18 St.,medium,1,1,1,1,1,1,0,0,1,0,0,0,1,0,1,1,0,0,1,0,0,0,0,0,6,1


In [30]:
#also add fancy to the training data
train['fancy'] = train['doorman'] | train['high_speed_internet'] | train['exclusive'] | train['swimming_pool']

train.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,month,fancy
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0
5,2.0,4,2016-04-19 04:24:47,,West 18th Street,40.7429,-74.0028,7995,350 West 18th Street,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0
6,1.0,2,2016-04-27 03:19:56,Stunning unit with a great location and lots o...,West 107th Street,40.8012,-73.966,3600,210 West 107th Street,low,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0


In [35]:
#let's incorporate fancy into the model

# 2. Instantiate this class
model2 = LinearRegression()

# 3. Arrange X features matrix & y target vector
features = ['bathrooms','bedrooms','fancy']
target = 'price'
X_train = train[features]
y_train = train[target]

# 4. Fit the model
model2.fit(X_train, y_train)
y_pred = model2.predict(X_train)
print(f'Model coefficient for bathrooms is {model2.coef_[0]}')
print(f'Model coefficient for bedrooms is {model2.coef_[1]}')
print(f'Model coefficient for fanciness is {model2.coef_[2]}')
print(f'Model intercept is {model2.intercept_}')
print('\n')

mae = mean_absolute_error(y_train, y_pred)
rmse = np.sqrt(mean_squared_error(y_train, y_pred))
r2 = r2_score(y_train, y_pred)
print(f'Train Root Mean Squared Error: {rmse}')
print(f'Train Mean Absolute Error: {mae}')
print(f'Train R2 score: {r2}')

print('\n')

# 5. Apply the model to new data
X_test = test[features]
y_test = test[target]
y_pred = model2.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f'Test Root Mean Squared Error: {rmse}')
print(f'Test Mean Absolute Error: {mae}')
print(f'Test R2 score: {r2}')
#looks like it's good for a few percentage points!

Model coefficient for bathrooms is 1905.3166883476335
Model coefficient for bedrooms is 439.8102868522924
Model coefficient for fanciness is 673.2598201786482
Model intercept is 293.74952578246393


Train Root Mean Squared Error: 1187.306841038542
Train Mean Absolute Error: 775.7081671290282
Train R2 score: 0.5459952580610337


Test Root Mean Squared Error: 1174.845294710409
Test Mean Absolute Error: 783.1127323609801
Test R2 score: 0.5559034679782427


In [32]:
#new feature: laundry, which is 1 if there's laundry in the unit or the building
train['laundry'] = train['laundry_in_building'] | train['laundry_in_unit']

train.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,month,fancy,laundry
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,1
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0
5,2.0,4,2016-04-19 04:24:47,,West 18th Street,40.7429,-74.0028,7995,350 West 18th Street,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0
6,1.0,2,2016-04-27 03:19:56,Stunning unit with a great location and lots o...,West 107th Street,40.8012,-73.966,3600,210 West 107th Street,low,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0


In [33]:
test['laundry'] = test['laundry_in_building'] | test['laundry_in_unit']

test.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,month,fancy,laundry
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,1,0
11,1.0,1,2016-06-03 03:21:22,Check out this one bedroom apartment in a grea...,W. 173rd Street,40.8448,-73.9396,1675,644 W. 173rd Street,low,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,0,0
14,1.0,1,2016-06-01 03:11:01,Spacious 1-Bedroom to fit King-sized bed comfo...,East 56th St..,40.7584,-73.9648,3050,315 East 56th St..,low,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,0,0
24,2.0,4,2016-06-07 04:39:56,SPRAWLING 2 BEDROOM FOUND! ENJOY THE LUXURY OF...,W 18 St.,40.7391,-73.9936,7400,30 W 18 St.,medium,1,1,1,1,1,1,0,0,1,0,0,0,1,0,1,1,0,0,1,0,0,0,0,0,6,1,0


In [37]:
#incorporate laundry into the model

# 2. Instantiate this class
model3 = LinearRegression()

# 3. Arrange X features matrix & y target vector
features = ['bathrooms','bedrooms','fancy','laundry']
target = 'price'
X_train = train[features]
y_train = train[target]

# 4. Fit the model
model3.fit(X_train, y_train)
y_pred = model3.predict(X_train)
print(f'Model coefficient for bathrooms is {model3.coef_[0]}')
print(f'Model coefficient for bedrooms is {model3.coef_[1]}')
print(f'Model coefficient for fanciness is {model3.coef_[2]}')
print(f'Model coefficient for laundry is {model3.coef_[3]}')
print(f'Model intercept is {model3.intercept_}')
print('\n')

mae = mean_absolute_error(y_train, y_pred)
rmse = np.sqrt(mean_squared_error(y_train, y_pred))
r2 = r2_score(y_train, y_pred)
print(f'Train Root Mean Squared Error: {rmse}')
print(f'Train Mean Absolute Error: {mae}')
print(f'Train R2 score: {r2}')

print('\n')

# 5. Apply the model to new data
X_test = test[features]
y_test = test[target]
y_pred = model3.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f'Test Root Mean Squared Error: {rmse}')
print(f'Test Mean Absolute Error: {mae}')
print(f'Test R2 score: {r2}') #laundry doesn't add much value to the model

Model coefficient for bathrooms is 1872.139940474443
Model coefficient for bedrooms is 429.0973198885481
Model coefficient for fanciness is 628.3757686540985
Model coefficient for laundry is 332.0910884897759
Model intercept is 294.3928674045296


Train Root Mean Squared Error: 1179.519101424111
Train Mean Absolute Error: 770.6845706760241
Train R2 score: 0.551931508187964


Test Root Mean Squared Error: 1171.6957668176913
Test Mean Absolute Error: 779.584670226292
Test R2 score: 0.5582813463412813


In [39]:
!pip install sodapy

from sodapy import Socrata

# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
client = Socrata("data.ny.gov", None)

# Example authenticated client (needed for non-public datasets):
# client = Socrata(data.ny.gov,
#                  MyAppToken,
#                  userame="user@example.com",
#                  password="AFakePassword")

# First 2000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("i9wp-a4ja", limit=2000)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)


Collecting sodapy
  Downloading https://files.pythonhosted.org/packages/3d/de/963fdd76b608c35c7e183c8fa299896fa779db71b7220f0d5c936294fde8/sodapy-1.5.5-py2.py3-none-any.whl
Collecting future>=0.17.1 (from sodapy)
[?25l  Downloading https://files.pythonhosted.org/packages/90/52/e20466b85000a181e1e144fd8305caf2cf475e2f9674e797b222f8105f5f/future-0.17.1.tar.gz (829kB)
[K     |████████████████████████████████| 829kB 13.8MB/s 
Building wheels for collected packages: future
  Building wheel for future (setup.py) ... [?25l[?25hdone
  Created wheel for future: filename=future-0.17.1-cp36-none-any.whl size=488730 sha256=aaea4955ffff837408a68dd4a61ae1fb740c47fe5901cacb33444962380cc642
  Stored in directory: /root/.cache/pip/wheels/0c/61/d2/d6b7317325828fbb39ee6ad559dbe4664d0896da4721bf379e
Successfully built future
Installing collected packages: future, sodapy
  Found existing installation: future 0.16.0
    Uninstalling future-0.16.0:
      Successfully uninstalled future-0.16.0
Successfull



In [42]:
print(results_df.shape)
results_df.head(10)

(1868, 32)


Unnamed: 0,ada,ada_notes,corner,division,east_west_street,entrance_latitude,entrance_location,entrance_longitude,entrance_type,entry,exit_only,free_crossover,line,north_south_street,route1,route10,route11,route2,route3,route4,route5,route6,route7,route8,route9,staff_hours,staffing,station_latitude,station_location,station_longitude,station_name,vending
0,False,,SW,BMT,25th St,40.660489,"{'latitude': '40.660489', 'longitude': '-73.99...",-73.99822,Stair,YES,,False,4 Avenue,4th Ave,R,,,,,,,,,,,,NONE,40.660397,"{'latitude': '40.660397', 'longitude': '-73.99...",-73.998091,25th St,YES
1,False,,SE,BMT,25th St,40.660323,"{'latitude': '40.660323', 'longitude': '-73.99...",-73.997952,Stair,YES,,False,4 Avenue,4th Ave,R,,,,,,,,,,,,FULL,40.660397,"{'latitude': '40.660397', 'longitude': '-73.99...",-73.998091,25th St,YES
2,False,,NW,BMT,36th St,40.654676,"{'latitude': '40.654676', 'longitude': '-74.00...",-74.004306,Stair,YES,,True,4 Avenue,4th Ave,N,,,R,,,,,,,,,FULL,40.655144,"{'latitude': '40.655144', 'longitude': '-74.00...",-74.003549,36th St,YES
3,False,,NE,BMT,36th St,40.654365,"{'latitude': '40.654365', 'longitude': '-74.00...",-74.004113,Stair,YES,,True,4 Avenue,4th Ave,N,,,R,,,,,,,,,FULL,40.655144,"{'latitude': '40.655144', 'longitude': '-74.00...",-74.003549,36th St,YES
4,False,,NW,BMT,36th St,40.65449,"{'latitude': '40.65449', 'longitude': '-74.004...",-74.004499,Stair,YES,,True,4 Avenue,4th Ave,N,,,R,,,,,,,,,FULL,40.655144,"{'latitude': '40.655144', 'longitude': '-74.00...",-74.003549,36th St,YES
5,False,,NE,BMT,45th St,40.649389,"{'latitude': '40.649389', 'longitude': '-74.00...",-74.009333,Stair,YES,,True,4 Avenue,4th Ave,R,,,,,,,,,,,,FULL,40.648939,"{'latitude': '40.648939', 'longitude': '-74.01...",-74.010006,45th St,YES
6,False,,NW,BMT,45th St,40.649424,"{'latitude': '40.649424', 'longitude': '-74.00...",-74.009728,Stair,YES,,True,4 Avenue,4th Ave,R,,,,,,,,,,,,FULL,40.648939,"{'latitude': '40.648939', 'longitude': '-74.01...",-74.010006,45th St,YES
7,False,,NE,BMT,45th St,40.649265,"{'latitude': '40.649265', 'longitude': '-74.00...",-74.009457,Stair,YES,,True,4 Avenue,4th Ave,R,,,,,,,,,,,,FULL,40.648939,"{'latitude': '40.648939', 'longitude': '-74.01...",-74.010006,45th St,YES
8,False,,NW,BMT,45th St,40.649546,"{'latitude': '40.649546', 'longitude': '-74.00...",-74.009602,Stair,YES,,True,4 Avenue,4th Ave,R,,,,,,,,,,,,FULL,40.648939,"{'latitude': '40.648939', 'longitude': '-74.01...",-74.010006,45th St,YES
9,False,,SW,BMT,53rd St,40.644756,"{'latitude': '40.644756', 'longitude': '-74.01...",-74.014586,Stair,YES,,True,4 Avenue,4th Ave,R,,,,,,,,,,,,FULL,40.645069,"{'latitude': '40.645069', 'longitude': '-74.01...",-74.014034,53rd St,YES


In [44]:
results_df.tail(10)

Unnamed: 0,ada,ada_notes,corner,division,east_west_street,entrance_latitude,entrance_location,entrance_longitude,entrance_type,entry,exit_only,free_crossover,line,north_south_street,route1,route10,route11,route2,route3,route4,route5,route6,route7,route8,route9,staff_hours,staffing,station_latitude,station_location,station_longitude,station_name,vending
1858,False,,NE,IRT,Longwood Ave,40.819562,"{'latitude': '40.819562', 'longitude': '-73.90...",-73.901498,Stair,YES,,False,White Plains Road,Prospect Ave,2,,,5,,,,,,,,,NONE,40.819585,"{'latitude': '40.819585', 'longitude': '-73.90...",-73.90177,Prospect Av,YES
1859,False,,NE,IRT,Longwood Ave,40.819684,"{'latitude': '40.819684', 'longitude': '-73.90...",-73.901348,Stair,YES,,False,White Plains Road,Prospect Ave,2,,,5,,,,,,,,,NONE,40.819585,"{'latitude': '40.819585', 'longitude': '-73.90...",-73.90177,Prospect Av,YES
1860,True,,NW,IRT,Simpson St,40.824177,"{'latitude': '40.824177', 'longitude': '-73.89...",-73.893114,Stair,YES,,False,White Plains Road,Westchester Ave,2,,,5,,,,,,,,,FULL,40.824073,"{'latitude': '40.824073', 'longitude': '-73.89...",-73.893064,Simpson St,YES
1861,True,,NE,IRT,Simpson St,40.824356,"{'latitude': '40.824356', 'longitude': '-73.89...",-73.892897,Elevator,YES,,False,White Plains Road,Westchester Ave,2,,,5,,,,,,,,,FULL,40.824073,"{'latitude': '40.824073', 'longitude': '-73.89...",-73.893064,Simpson St,YES
1862,True,,NE,IRT,Simpson St,40.824358,"{'latitude': '40.824358', 'longitude': '-73.89...",-73.892724,Stair,YES,,False,White Plains Road,Westchester Ave,2,,,5,,,,,,,,,FULL,40.824073,"{'latitude': '40.824073', 'longitude': '-73.89...",-73.893064,Simpson St,YES
1863,True,,SW,IRT,Simpson St,40.823884,"{'latitude': '40.823884', 'longitude': '-73.89...",-73.893197,Elevator,YES,,False,White Plains Road,Westchester Ave,2,,,5,,,,,,,,,NONE,40.824073,"{'latitude': '40.824073', 'longitude': '-73.89...",-73.893064,Simpson St,YES
1864,True,,SE,IRT,Simpson St,40.824118,"{'latitude': '40.824118', 'longitude': '-73.89...",-73.892731,Stair,YES,,False,White Plains Road,Westchester Ave,2,,,5,,,,,,,,,NONE,40.824073,"{'latitude': '40.824073', 'longitude': '-73.89...",-73.893064,Simpson St,YES
1865,False,,SE,IRT,241st St,40.903283,"{'latitude': '40.903283', 'longitude': '-73.85...",-73.850321,Stair,YES,,True,White Plains Road,White Plains Rd,2,,,5,,,,,,,,,FULL,40.903125,"{'latitude': '40.903125', 'longitude': '-73.85...",-73.85062,Wakefield-241st St,YES
1866,False,,NW,IRT,241st St,40.903597,"{'latitude': '40.903597', 'longitude': '-73.85...",-73.85036,Stair,YES,,True,White Plains Road,White Plains Rd,2,,,5,,,,,,,,,FULL,40.903125,"{'latitude': '40.903125', 'longitude': '-73.85...",-73.85062,Wakefield-241st St,YES
1867,False,,NE,IRT,241st St,40.903509,"{'latitude': '40.903509', 'longitude': '-73.85...",-73.850124,Stair,YES,,True,White Plains Road,White Plains Rd,2,,,5,,,,,,,,,FULL,40.903125,"{'latitude': '40.903125', 'longitude': '-73.85...",-73.85062,Wakefield-241st St,YES


In [49]:
float(results_df.iloc[1867][7])

-73.850124

In [0]:
#minimum gridwise distance to subway station entrance

def minsubdist(lat,long):
  mindist = (abs(lat - 40.660489) + abs(long - -73.998220))
  for i in range(1,1867):
    dist = (abs(lat - float(results_df.iloc[i][5])) + abs(long - float(results_df.iloc[i][7])))
    if dist < mindist:
      mindist = dist
  return mindist

In [51]:
minsubdist(40.7388,-74.0018)

0.001338999999994428

In [54]:
#train['subway-dist'] = train.apply(lambda row: minsubdist(row.latitude,row.longitude),axis=1)

#train.head()

#runs forever -- I've done some work to make the code more efficient, but it hasn't produced fruit yet

KeyboardInterrupt: ignored