<a href="https://colab.research.google.com/github/ekselan/DS-Unit-2-Linear-Models/blob/master/module2-regression-2/2_1_2_A_Copy_of_LS_DS_212_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

###Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.

In [0]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [86]:
print(df.shape)
df.head(3)

(48817, 34)


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [87]:
# April and May into train data
train = df[df['created'].str.contains('-04-') | 
           df['created'].str.contains('-05-')]
print(train.shape)
train.head(3)

(31844, 34)


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [88]:
train['created'].str.contains('-06-').sum()
# Looks like June has been succesfully removed

0

In [89]:
a = df['created'].str.contains('-04-').sum()
b = df['created'].str.contains('-05-').sum()

a+b
# Looks like I still have appropriate number of rows, 
# so train data should be g2g

31844

In [90]:
# June into test data
test = df[df['created'].str.contains('-06-')] 
print(test.shape)
test.head(3)

(16973, 34)


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
11,1.0,1,2016-06-03 03:21:22,Check out this one bedroom apartment in a grea...,W. 173rd Street,40.8448,-73.9396,1675,644 W. 173rd Street,low,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [91]:
df['created'].str.contains('-06-').sum()
# Looks like test data (June) has appropriate number of rows as well

16973

###Engineer at least two new features. 

In [92]:
# I'll make a column for total number of pets allowed (cats/dogs)
train['total_pets'] = train['cats_allowed'] + train['dogs_allowed']
train.head(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,total_pets
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [93]:
# I'll make a column that is the total number of amenities

train['amenities'] = train['elevator'] + train['doorman'] + train['dishwasher'] 
+ train['laundry_in_building'] + train['fitness_center'] + train['laundry_in_unit']
+ train['roof_deck'] + train['outdoor_space'] + train['high_speed_internet']
+ train['balcony'] + train['swimming_pool'] + train['terrace'] + train['garden_patio']
+ train['wheelchair_access'] + train['common_outdoor_space'] 

train.head(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,total_pets,amenities
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


In [94]:
# I'll make a column for total number of rooms

train['total_rooms'] = train['bathrooms'] + train['bedrooms']
train.head(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,total_pets,amenities,total_rooms
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2.0


In [97]:
# I also think a column for "special location" could be interesting, meaning
# pre-war, exclusive or loft - I would think having a few of these or all 3 
# could impact price

train['spec_loc'] = train['pre-war'] + train['exclusive'] + train['loft']
train.head(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,total_pets,amenities,total_rooms,spec_loc
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2.0,0


###Fit a linear regression model with at least two features.

In [102]:
# First, I'll establish a baseline which is simply the mean rental price

target = 'price'
y_train = train[target]
y_test = test[target]
guess = y_train.mean()
print('Mean baseline:', guess)

Mean baseline: 3575.604007034292


In [106]:
# And I'll establish my baseline error

y_pred = [guess] * len(y_train)
mae = mean_absolute_error(y_train, y_pred)
print(f'Baseline train error: ${mae:.0f}')

Baseline train error: $1202


In [107]:
y_pred = [guess] * len(y_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'Baseline test error: ${mae:.0f}')

Baseline test error: $1198


In [137]:
# Now I'll fit a linear regression model

model = LinearRegression()

features = ['amenities','total_rooms','total_pets','spec_loc','bedrooms',
            'bathrooms']

X_train = train[features]
y_train = train[target]

model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [138]:
y_pred = model.predict(X_train)
mae = mean_absolute_error(y_train, y_pred)
print(f'Train error: ${mae:.0f}')

# Looks like an improvement of about -$500 error

Train error: $774


In [0]:
# I'll need to add those features to the test data in order to compare

test['total_pets'] = test['cats_allowed'] + test['dogs_allowed']

test['total_rooms'] = test['bathrooms'] + test['bedrooms']

test['spec_loc'] = test['pre-war'] + test['exclusive'] + test['loft']

test['amenities'] = test['elevator'] + test['doorman'] + test['dishwasher'] 
+ test['laundry_in_building'] + test['fitness_center'] + test['laundry_in_unit']
+ test['roof_deck'] + test['outdoor_space'] + test['high_speed_internet']
+ test['balcony'] + test['swimming_pool'] + test['terrace'] + test['garden_patio']
+ test['wheelchair_access'] + test['common_outdoor_space'] 

In [139]:
X_test = test[features]

y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'Test error: ${mae:.0f}')

# Similar improvement as the train error

Test error: $782


###Get the model's coefficients and intercept.

In [142]:
print('Intercept', model.intercept_)
print(f'Amenities coefficient: ${model.coef_[0]:.0f}')
print(f'Total rooms coefficient: ${model.coef_[1]:.0f}')
print(f'Total pets coefficient: ${model.coef_[2]:.0f}')
print(f'Special location coefficient: ${model.coef_[3]:.0f}')
print(f'Bedrooms coefficient: ${model.coef_[4]:.0f}')
print(f'Bathrooms coefficient: ${model.coef_[5]:.0f}')

Intercept 192.89845792840106
Amenities coefficient: $288
Total rooms coefficient: $771
Total pets coefficient: $70
Special location coefficient: $-2
Bedrooms coefficient: $-351
Bathrooms coefficient: $1122


###Get regression metrics RMSE, MAE, and  𝑅2 , for both the train and test data.

In [145]:
# Regression metrics for TRAIN data

y_pred = model.predict(X_train)
mse = mean_squared_error(y_train, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_train, y_pred)
r2 = r2_score(y_train, y_pred)
print(f'Root Mean Squared Error: ${rmse:.0f}')
print(f'Mean Absolute Error: ${mae:.0f}')
print('R^2:', r2)

Root Mean Squared Error: $1185
Mean Absolute Error: $774
R^2: 0.5480124230224228


In [146]:
# Regression metrics for TEST data

y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Root Mean Squared Error: ${rmse:.0f}')
print(f'Mean Absolute Error: ${mae:.0f}')
print('R^2:', r2)

Root Mean Squared Error: $1174
Mean Absolute Error: $782
R^2: 0.5566894643923536


#Stretch - how low can we go?

In [0]:
# I'll try encoding the interest level to see if that has an impact

In [149]:
test.columns

Index(['bathrooms', 'bedrooms', 'created', 'description', 'display_address',
       'latitude', 'longitude', 'price', 'street_address', 'interest_level',
       'elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed',
       'doorman', 'dishwasher', 'no_fee', 'laundry_in_building',
       'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck',
       'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony',
       'swimming_pool', 'new_construction', 'terrace', 'exclusive', 'loft',
       'garden_patio', 'wheelchair_access', 'common_outdoor_space',
       'total_pets', 'total_rooms', 'spec_loc', 'amenities'],
      dtype='object')

In [153]:
test['interest_level'].value_counts()

low       11893
medium     3800
high       1280
Name: interest_level, dtype: int64

In [0]:
train['interest_level'] = train['interest_level'].str.replace('low','1')
train['interest_level'] = train['interest_level'].str.replace('medium','2')
train['interest_level'] = train['interest_level'].str.replace('high','3')
train['interest_level'] = train['interest_level'].astype(int)

In [0]:
test['interest_level'] = test['interest_level'].str.replace('low','1')
test['interest_level'] = test['interest_level'].str.replace('medium','2')
test['interest_level'] = test['interest_level'].str.replace('high','3')
test['interest_level'] = test['interest_level'].astype(int)
# test.head()

In [156]:
# Now I can view interest level in the correlation matrix
test.corr()

Unnamed: 0,bathrooms,bedrooms,latitude,longitude,price,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,total_pets,total_rooms,spec_loc,amenities
bathrooms,1.0,0.520583,0.014415,-0.020741,0.693402,-0.069941,0.140601,0.020345,0.0997,0.023616,0.162787,0.169123,0.129353,-0.010312,0.149215,-0.019194,0.204784,0.098999,0.161634,0.239727,0.085464,0.152496,0.113849,0.070013,0.155035,-0.004665,0.027272,0.107076,0.065974,0.009194,0.022318,0.741118,-0.006653,0.199161
bedrooms,0.520583,1.0,0.001349,0.054167,0.53364,0.047804,-0.014186,-0.016543,0.097253,-0.01524,-0.048462,0.143979,0.164118,0.025917,0.008396,0.009236,0.148428,0.048312,0.137904,0.187361,0.065669,0.096399,0.0318,0.006303,0.098263,-0.013868,-0.102406,0.066459,0.013513,0.022232,-0.016139,0.959041,-0.041665,0.033946
latitude,0.014415,0.001349,1.0,0.329271,-0.031078,-0.05732,0.000235,-0.018981,0.017155,-0.017736,-0.044981,-0.014028,-0.042909,-0.013163,-0.109319,0.024126,-0.038179,-0.06502,-0.070392,0.018383,-0.025771,0.011307,0.024684,-0.059262,0.00951,-0.070991,-0.013922,0.002308,-0.059856,-0.102085,-0.018644,0.005843,-0.017474,-0.024669
longitude,-0.020741,0.054167,0.329271,1.0,-0.252779,0.069824,-0.188735,-0.046454,-0.112399,-0.056391,-0.278114,-0.18246,-0.088717,-0.014,-0.252391,-0.004068,-0.142195,-0.166744,-0.080724,-0.038748,-0.1201,-0.034532,-0.081154,-0.109685,-0.041809,0.042698,-0.063835,-0.030993,-0.062536,-0.076411,-0.052209,0.035712,-0.011199,-0.273695
price,0.693402,0.53364,-0.031078,-0.252779,1.0,-0.207575,0.212268,0.050087,0.094065,0.059416,0.283244,0.216728,0.126781,-0.017308,0.233688,-0.027935,0.25638,0.121825,0.153403,0.249157,0.08671,0.154628,0.138796,0.071753,0.152711,-0.017591,0.020823,0.124787,0.071626,0.023401,0.05559,0.649665,-0.022478,0.30024
interest_level,-0.069941,0.047804,-0.05732,0.069824,-0.207575,1.0,-0.003275,-0.060314,0.112862,-0.065071,-0.067269,0.066996,0.152541,0.097222,-0.041414,-0.034062,0.022886,0.01232,0.058982,0.035859,0.035929,0.025006,0.001477,0.01641,0.019863,0.024766,0.01665,0.019278,0.018075,0.024912,-0.063659,0.014385,-0.010394,-0.001567
elevator,0.140601,-0.014186,0.000235,-0.188735,0.212268,-0.003275,1.0,0.022656,0.276414,0.02729,0.609192,0.363356,0.221846,0.133981,0.429911,-0.098665,0.144183,0.334043,0.186838,0.20586,0.28047,0.159064,0.18325,0.177571,0.121598,0.028915,0.058906,0.076754,0.148934,0.096983,0.025355,0.035493,-0.044785,0.83427
cats_allowed,0.020345,-0.016543,-0.018981,-0.046454,0.050087,-0.060314,0.022656,1.0,-0.201139,0.939425,0.070537,-0.06399,-0.027768,0.103915,0.11017,0.050569,-0.026734,0.013641,0.038143,-0.025231,0.066808,0.017354,0.00622,0.017957,-0.003535,0.028961,-0.039915,0.00039,0.031567,0.088855,0.984802,-0.006259,0.038462,0.012458
hardwood_floors,0.0997,0.097253,0.017155,-0.112399,0.094065,0.112862,0.276414,-0.201139,1.0,-0.207966,0.230067,0.635776,0.334754,-0.140287,0.181312,0.017247,0.374937,0.291019,0.198695,0.31917,0.23191,0.182953,0.17689,0.191321,0.184415,-0.197301,0.117312,0.159501,0.126751,-0.111587,-0.207715,0.10955,-0.024464,0.48087
dogs_allowed,0.023616,-0.01524,-0.017736,-0.056391,0.059416,-0.065071,0.02729,0.939425,-0.207966,1.0,0.079459,-0.057223,-0.012614,0.0975,0.115696,0.050945,-0.014681,0.020517,0.041828,-0.014268,0.074857,0.024899,0.008304,0.030409,-0.002588,0.032853,-0.039507,0.002073,0.038153,0.088793,0.984677,-0.004149,0.040692,0.021016


In [162]:
# Then re-fit the model

features = ['amenities','total_rooms','total_pets','spec_loc','bedrooms',
            'bathrooms','interest_level']

X_train = train[features]
y_train = train[target]

model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [163]:
X_test = test[features]

y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'Test error: ${mae:.0f}')

# Test error went from $782 to $742, so a slight improvement ...

Test error: $742


In [0]:
# Let's try adding longitude ...

In [164]:
features = ['amenities','total_rooms','total_pets','spec_loc','bedrooms',
            'bathrooms','interest_level','longitude']

X_train = train[features]
y_train = train[target]

model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [165]:
X_test = test[features]

y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'Test error: ${mae:.0f}')

# Down to $689

Test error: $689


In [0]:
# Dining-room looked like it had some correlation, and I didn't include that as 
# an 'amenity'

In [166]:
features = ['amenities','total_rooms','total_pets','spec_loc','bedrooms',
            'bathrooms','interest_level','longitude','dining_room']

X_train = train[features]
y_train = train[target]

model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [167]:
X_test = test[features]

y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'Test error: ${mae:.0f}')

# No difference, so I'll be satisfied with $689 for now

Test error: $689
