Lambda School Data Science

*Unit 2, Sprint 1, Module 3*

---

# Ridge Regression

## Assignment

We're going back to our other **New York City** real estate dataset. Instead of predicting apartment rents, you'll predict property sales prices.

But not just for condos in Tribeca...

- [x] Use a subset of the data where `BUILDING_CLASS_CATEGORY` == `'01 ONE FAMILY DWELLINGS'` and the sale price was more than 100 thousand and less than 2 million.
- [X] Do train/test split. Use data from January — March 2019 to train. Use data from April 2019 to test.
- [x] Do one-hot encoding of categorical features.
- [ ] Do feature selection with `SelectKBest`.
- [ ] Fit a ridge regression model with multiple features. Use the `normalize=True` parameter (or do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html) beforehand — use the scaler's `fit_transform` method with the train set, and the scaler's `transform` method with the test set)
- [ ] Get mean absolute error for the test set.
- [ ] As always, commit your notebook to your fork of the GitHub repo.

The [NYC Department of Finance](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page) has a glossary of property sales terms and NYC Building Class Code Descriptions. The data comes from the [NYC OpenData](https://data.cityofnewyork.us/browse?q=NYC%20calendar%20sales) portal.


## Stretch Goals

Don't worry, you aren't expected to do all these stretch goals! These are just ideas to consider and choose from.

- [ ] Add your own stretch goal(s) !
- [ ] Instead of `Ridge`, try `LinearRegression`. Depending on how many features you select, your errors will probably blow up! 💥
- [ ] Instead of `Ridge`, try [`RidgeCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html).
- [ ] Learn more about feature selection:
    - ["Permutation importance"](https://www.kaggle.com/dansbecker/permutation-importance)
    - [scikit-learn's User Guide for Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html)
    - [mlxtend](http://rasbt.github.io/mlxtend/) library
    - scikit-learn-contrib libraries: [boruta_py](https://github.com/scikit-learn-contrib/boruta_py) & [stability-selection](https://github.com/scikit-learn-contrib/stability-selection)
    - [_Feature Engineering and Selection_](http://www.feat.engineering/) by Kuhn & Johnson.
- [ ] Try [statsmodels](https://www.statsmodels.org/stable/index.html) if you’re interested in more inferential statistical approach to linear regression and feature selection, looking at p values and 95% confidence intervals for the coefficients.
- [ ] Read [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapters 1-3, for more math & theory, but in an accessible, readable way.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [142]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [143]:
import pandas as pd
import pandas_profiling

# Read New York City property sales data
df = pd.read_csv(DATA_PATH+'condos/NYC_Citywide_Rolling_Calendar_Sales.csv')

# Change column names: replace spaces with underscores
df.columns = [col.replace(' ', '_') for col in df]

# SALE_PRICE was read as strings.
# Remove symbols, convert to integer
df['SALE_PRICE'] = (
    df['SALE_PRICE']
    .str.replace('$','')
    .str.replace('-','')
    .str.replace(',','')
    .astype(int)
)

In [144]:
# BOROUGH is a numeric column, but arguably should be a categorical feature,
# so convert it from a number to a string
df['BOROUGH'] = df['BOROUGH'].astype(str)

In [145]:
# Reduce cardinality for NEIGHBORHOOD feature

# Get a list of the top 10 neighborhoods
top10 = df['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10, 
# replace the neighborhood with 'OTHER'
df.loc[~df['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'

In [293]:
df['SALE_DATE'] = pd.to_datetime(df['SALE_DATE'])

df['SALE_MONTH'] = df['SALE_DATE'].map(lambda x: x.month)

df = df[(df.BUILDING_CLASS_CATEGORY == '01 ONE FAMILY DWELLINGS')
         &
         (df.SALE_PRICE > 100000)
         &
         (df.SALE_PRICE < 2000000)]

train = df[df['SALE_MONTH'] < 4]
test = df[df['SALE_MONTH'] == 4]

target = 'SALE_PRICE'
high_cardinality = ['SALE_DATE','BUILDING_CLASS_AT_TIME_OF_SALE',
                    'BUILDING_CLASS_AT_PRESENT']
features = train.columns.drop([target]+high_cardinality+['EASE-MENT'])

X_train = train[features] 
y_train = train[target]

X_test = test[features]
y_test = test[target]

import category_encoders as ce
encoder = ce.OneHotEncoder(use_cat_names=True)
X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test)

from sklearn.feature_selection import SelectKBest, f_regression
selector = SelectKBest(k=1304)
X_train_selected = selector.fit_transform(X_train,y_train)
X_test_selected = selector.transform(X_test)

from sklearn.linear_model import Ridge
ridge = Ridge(alpha=.01,normalize=True)
ridge.fit(X_train_selected,y_train)
y_pred = ridge.predict(X_test_selected)

   79   91   98   99  101  108  112  117  118  131  137  140  151  162
  163  164  166  171  173  176  184  193  195  202  211  217  219  234
  237  240  242  248  249  250  253  265  269  286  297  300  301  313
  316  324  334  337  341  347  350  352  356  364  365  368  369  372
  373  374  376  377  386  395  398  399  402  403  416  418  420  424
  428  429  435  439  444  448  464  465  470  477  478  484  493  497
  501  505  506  509  513  517  518  521  522  523  532  537  546  549
  551  554  559  575  579  582  584  586  590  591  594  601  603  609
  611  612  618  623  625  626  627  629  630  631  633  636  638  640
  645  647  654  655  661  669  674  675  679  681  683  684  687  694
  697  699  701  709  712  716  717  726  738  740  741  754  764  768
  769  770  772  775  776  783  784  792  810  819  821  825  836  848
  849  856  858  863  865  872  875  877  878  879  887  888  896  900
  901  902  903  904  910  912  915  917  926  927  930  934  938  942
  943 

In [294]:
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test,y_pred)

In [295]:

mae

163504.25180487984

In [291]:
y_pred

array([ 640108.92174493,  345435.89862877,  822341.91535582,
        695941.37896566,  500886.55612654,  664410.80796998,
        625347.83847976,  677148.83976646,  704065.95081663,
        732921.78838459,  737298.11732492,  453684.02239927,
        439653.87535395,  788974.51584214,  478222.01684372,
        479081.55694354,  629108.50296508,  927927.59913966,
        591099.3207175 ,  704912.7911558 ,  961255.97200589,
        511262.70374606,  663118.48485582,  548329.36736796,
        665079.75004698,  564533.59485664,  443095.62111879,
        768122.45420153,  647792.05822698,  575839.88001275,
        345384.27878799,  586605.31597819,  590710.24600041,
        429392.69672293,  848866.93984558,  682547.36433318,
        640785.80568358,  573015.44092965,  504216.49740862,
        539101.94819779,  426833.44816176,  494522.0583722 ,
       1075031.10460595,  695829.81499907,  717843.60166446,
        701975.09414256,  843792.49141315,  675329.54060078,
        623318.16926579,

In [292]:
y_test

18235     895000
18239     253500
18244    1300000
18280     789000
18285     525000
          ...   
23029     635000
23031     514000
23032     635000
23033     545000
23035     510000
Name: SALE_PRICE, Length: 644, dtype: int64