# Assignment
1. Start a clean notebook. Follow the processes taught today.
2. Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
3. Begin with baseline for regression.
4. Select two or more features. 
5. Do one-hot encoding. (Remember it may not work with high cardinality categoricals.)
4. Use scikit-learn to fit a Linear Regression model on the train data.
5. Apply the model to predict rent prices for the test data.
6. Get the mean absolute error for the test data.
7. Get the model's coefficients and intercept.
8. Commit your notebook to your fork of the GitHub repo.

_What's the best test MAE you can get? Share your score and features used with your cohort on Slack!_

### Stretch Goals
- Try at least 3 different feature combinations.
- Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [Engineer new features!](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

- Try different [scikit-learn scalers](https://scikit-learn.org/stable/modules/preprocessing.html)
- Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html):

> Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification. Pipeline serves multiple purposes here:

> - **Convenience and encapsulation.** You only have to call fit and predict once on your data to fit a whole sequence of estimators.
> - **Joint parameter selection.** You can grid search over parameters of all estimators in the pipeline at once.
> - **Safety.** Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.

In [1]:
!pip install category_encoders



In [0]:
import numpy as np
import pandas as pd
from math import sqrt
import category_encoders as ce
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

In [0]:
LOCAL = '../data/nyc/nyc-rent-2016.csv'
WEB = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/nyc/nyc-rent-2016.csv'

df = pd.read_csv(WEB)
assert df.shape == (48300, 34)

In [35]:
df.sample(5)

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,exclusive,terrace,loft,garden_patio,common_outdoor_space,wheelchair_access,month
13576,1.0,2,2016-06-22 06:47:25,Renovated true Two Bedroom in Luxury BuildingA...,East 33rd Street,40.745,-73.9779,3750,205 East 33rd Street,low,0,0,1,0,1,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6
30091,1.0,3,2016-05-17 13:56:46,**NO FEE**W 50TH/9TH AVETrue 3 bedroom unit--a...,W 50th St.\r,40.7643,-73.9905,4295,439 W 50th St.\r,high,0,1,1,1,0,1,1,0,0,0,1,0,1,0,0,1,0,0,0,1,0,1,0,0,5
27811,1.0,1,2016-05-19 14:19:32,"Spacious, 1 bedroom 1 bathroom in FiDi with a ...",Financial District,40.7075,-74.0113,3200,Financial District,low,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5
30673,1.0,0,2016-05-26 01:37:15,Chelsea studio available for immediate move-in...,West 25th Street,40.7478,-74.0008,2675,400 West 25th Street,low,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5
39488,1.0,2,2016-05-20 01:34:34,Brand new renovation featuring renovated kitch...,West 101st Street,40.7978,-73.969,3495,215 West 101st Street,low,0,1,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5


In [148]:
df['created'] = pd.to_datetime(df['created'], infer_datetime_format=True)
# extract the month
df['month'] = df['created'].dt.month

# use the month to select train and test sets
train = df.query('month < 6')
test = df.query('month == 6')
train.shape, test.shape, df.shape

((31515, 35), (16785, 35), (48300, 35))

In [0]:
df['length'] = df['description'].str.len()
df['length'] = df['length'].fillna(value=0)

In [146]:
df.select_dtypes(include=np.number).columns

Index(['bathrooms', 'bedrooms', 'latitude', 'longitude', 'price', 'elevator',
       'cats_allowed', 'hardwood_floors', 'dogs_allowed', 'doorman',
       'dishwasher', 'no_fee', 'laundry_in_building', 'fitness_center',
       'pre-war', 'laundry_in_unit', 'roof_deck', 'outdoor_space',
       'dining_room', 'high_speed_internet', 'balcony', 'swimming_pool',
       'new_construction', 'exclusive', 'terrace', 'loft', 'garden_patio',
       'common_outdoor_space', 'wheelchair_access', 'month', 'length'],
      dtype='object')

In [131]:
train.describe()

Unnamed: 0,bathrooms,bedrooms,latitude,longitude,price,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,exclusive,terrace,loft,garden_patio,common_outdoor_space,wheelchair_access,month,length
count,31515.0,31515.0,31515.0,31515.0,31515.0,31515.0,31515.0,31515.0,31515.0,31515.0,31515.0,31515.0,31515.0,31515.0,31515.0,31515.0,31515.0,31515.0,31515.0,31515.0,31515.0,31515.0,31515.0,31515.0,31515.0,31515.0,31515.0,31515.0,31515.0,31515.0,31515.0
mean,1.180406,1.500619,40.750767,-73.972464,3432.753419,0.526162,0.476122,0.476884,0.444645,0.423322,0.412819,0.365984,0.057687,0.262446,0.186419,0.168523,0.131556,0.141425,0.098017,0.086943,0.058861,0.053594,0.05004,0.042519,0.045439,0.042932,0.038077,0.029224,0.027162,4.490274,602.423671
std,0.42897,1.092892,0.039319,0.029628,1398.444608,0.499323,0.499437,0.499473,0.496934,0.494093,0.492349,0.481713,0.233154,0.439971,0.389451,0.374336,0.338013,0.348465,0.297342,0.281756,0.235368,0.225217,0.21803,0.201774,0.208268,0.202707,0.191385,0.168437,0.162557,0.499913,390.477039
min,0.0,0.0,40.5757,-74.094,1025.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
25%,1.0,1.0,40.7283,-73.9917,2490.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,343.0
50%,1.0,1.0,40.7516,-73.9781,3100.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,565.0
75%,1.0,2.0,40.7735,-73.9547,4000.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,809.0
max,10.0,7.0,40.9154,-73.7001,9999.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,5.0,4466.0


In [73]:
train.describe(exclude='number')

Unnamed: 0,created,description,display_address,street_address,interest_level
count,31515,30549.0,31447,31509,31515
unique,31116,25482.0,6492,11247,3
top,2016-05-02 03:41:36,,Broadway,505 West 37th Street,low
freq,3,897.0,268,120,21613
first,2016-04-01 22:12:41,,,,
last,2016-05-31 23:10:48,,,,


In [0]:
# dogs_allowed > cats_allowed
# no_fee,garden_patio,common_outdoor_space,balcony, swimming_pool, month,length worse

In [155]:
# OneHotEncoder
features = ['hardwood_floors',
 'laundry_in_unit',
 'exclusive',
 'dogs_allowed',
 'wheelchair_access',
 'terrace',
 'doorman',
 'bathrooms',
 'interest_level',
 'swimming_pool',
 'fitness_center',
 'laundry_in_building',
 'high_speed_internet',
 'dining_room',
 'outdoor_space',
 'elevator',
 'pre-war',
 'roof_deck',
 'new_construction',
 'longitude',
 'dishwasher',
 'bedrooms']#['interest_level','bathrooms','bedrooms','elevator',
#             'hardwood_floors','dogs_allowed','doorman','dishwasher',
#             'fitness_center','laundry_in_unit','roof_deck','high_speed_internet',
#             'new_construction','exclusive','terrace','dining_room',
#             'laundry_in_building','pre-war','longitude','swimming_pool',
#             'outdoor_space','wheelchair_access']
# ,'cats_allowed','loft','latitude'
target = 'price'
X_train = train[features]
y_train = train[target]
X_test = test[features]
y_test = test[target]
encoder = ce.OneHotEncoder(use_cat_names=True)
X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('Test MAE: ', mean_absolute_error(y_test, y_pred))
print('Test RMSE: ', sqrt(mean_squared_error(y_test, y_pred)))
print('Test R^2: ', r2_score(y_test, y_pred))
coe, inter = model.coef_, model.intercept_
print('Intercept: ', inter)
print('Coefficients: \n', coe)

Test MAE:  590.2408698242479
Test RMSE:  843.4513236805835
Test R^2:  0.6406059322000184
Intercept:  -9878624686394476.0
Coefficients: 
 [-1.46884133e+02  3.48829432e+02  4.53468583e+01  5.65495174e+01
  1.17045083e+02  9.36139327e+01  3.80593945e+02  1.20604775e+03
  9.87862469e+15  9.87862469e+15  9.87862469e+15  2.98555799e+01
  7.40333299e+01 -1.75387750e+02 -2.61458738e+02  8.99429870e+01
 -8.11172755e+01  1.86891495e+02 -8.94740585e+01 -1.28705897e+02
 -1.67681809e+02 -1.23402576e+04  9.58382981e+01  5.09515155e+02]


In [0]:
%matplotlib inline

plt.scatter(X_train_encoded, y_train, alpha=0.01)
plt.scatter(X_test_encoded, y_pred)
plt.plot(X_test_encoded, y_pred)

plt.title('Linear Regression');

In [141]:
# OneHotEncoder & StandardScaler

features = ['interest_level','bathrooms','bedrooms','elevator','cats_allowed',
            'hardwood_floors','dogs_allowed','doorman','dishwasher',
            'fitness_center','laundry_in_unit','roof_deck','high_speed_internet',
            'new_construction','exclusive','terrace','loft','dining_room',
            'laundry_in_building','pre-war','latitude','longitude',
            'outdoor_space','wheelchair_access']
target = 'price'

X_train = train[features]
y_train = train[target]

X_test = test[features]
y_test = test[target]

encoder = ce.OneHotEncoder(use_cat_names=True)
X_train_encoded = encoder.fit_transform(X_train)
X_test_encoded = encoder.transform(X_test)

scaler = StandardScaler()
X_train_scaler = scaler.fit_transform(X_train_encoded)
X_test_scaler = scaler.transform(X_test_encoded)

model = LinearRegression()
model.fit(X_train_scaler, y_train)

y_pred = model.predict(X_test_scaler)
print('Test MAE: ', mean_absolute_error(y_test, y_pred))

Test MAE:  591.1121061546107
