Lambda School Data Science

*Unit 2, Sprint 1, Module 3*

---

# Ridge Regression

## Assignment

We're going back to our other **New York City** real estate dataset. Instead of predicting apartment rents, you'll predict property sales prices.

But not just for condos in Tribeca...

- [ ] Use a subset of the data where `BUILDING_CLASS_CATEGORY` == `'01 ONE FAMILY DWELLINGS'` and the sale price was more than 100 thousand and less than 2 million.
- [ ] Do train/test split. Use data from January — March 2019 to train. Use data from April 2019 to test.
- [ ] Do one-hot encoding of categorical features.
- [ ] Do feature selection with `SelectKBest`.
- [ ] Fit a ridge regression model with multiple features. Use the `normalize=True` parameter (or do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html) beforehand — use the scaler's `fit_transform` method with the train set, and the scaler's `transform` method with the test set)
- [ ] Get mean absolute error for the test set.
- [ ] As always, commit your notebook to your fork of the GitHub repo.

The [NYC Department of Finance](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page) has a glossary of property sales terms and NYC Building Class Code Descriptions. The data comes from the [NYC OpenData](https://data.cityofnewyork.us/browse?q=NYC%20calendar%20sales) portal.


## Stretch Goals

Don't worry, you aren't expected to do all these stretch goals! These are just ideas to consider and choose from.

- [ ] Add your own stretch goal(s) !
- [ ] Instead of `Ridge`, try `LinearRegression`. Depending on how many features you select, your errors will probably blow up! 💥
- [ ] Instead of `Ridge`, try [`RidgeCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html).
- [ ] Learn more about feature selection:
    - ["Permutation importance"](https://www.kaggle.com/dansbecker/permutation-importance)
    - [scikit-learn's User Guide for Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html)
    - [mlxtend](http://rasbt.github.io/mlxtend/) library
    - scikit-learn-contrib libraries: [boruta_py](https://github.com/scikit-learn-contrib/boruta_py) & [stability-selection](https://github.com/scikit-learn-contrib/stability-selection)
    - [_Feature Engineering and Selection_](http://www.feat.engineering/) by Kuhn & Johnson.
- [ ] Try [statsmodels](https://www.statsmodels.org/stable/index.html) if you’re interested in more inferential statistical approach to linear regression and feature selection, looking at p values and 95% confidence intervals for the coefficients.
- [ ] Read [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapters 1-3, for more math & theory, but in an accessible, readable way.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [186]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [187]:
import pandas as pd
import pandas_profiling

# Read New York City property sales data
df = pd.read_csv(DATA_PATH+'condos/NYC_Citywide_Rolling_Calendar_Sales.csv')

# Change column names: replace spaces with underscores
df.columns = [col.replace(' ', '_') for col in df]

# SALE_PRICE was read as strings.
# Remove symbols, convert to integer
df['SALE_PRICE'] = (
    df['SALE_PRICE']
    .str.replace('$','')
    .str.replace('-','')
    .str.replace(',','')
    .astype(int)
)

In [188]:
# BOROUGH is a numeric column, but arguably should be a categorical feature,
# so convert it from a number to a string
df['BOROUGH'] = df['BOROUGH'].astype(str)

In [189]:
# Reduce cardinality for NEIGHBORHOOD feature

# Get a list of the top 10 neighborhoods
top10 = df['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10, 
# replace the neighborhood with 'OTHER'
df.loc[~df['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'

In [190]:
# Read DataFrame head
df.head()

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
0,1,OTHER,13 CONDOS - ELEVATOR APARTMENTS,2,716,1246,,R4,"447 WEST 18TH STREET, PH12A",PH12A,10011.0,1.0,0.0,1.0,10733,1979.0,2007.0,2,R4,0,01/01/2019
1,1,OTHER,21 OFFICE BUILDINGS,4,812,68,,O5,144 WEST 37TH STREET,,10018.0,0.0,6.0,6.0,2962,15435.0,1920.0,4,O5,0,01/01/2019
2,1,OTHER,21 OFFICE BUILDINGS,4,839,69,,O5,40 WEST 38TH STREET,,10018.0,0.0,7.0,7.0,2074,11332.0,1930.0,4,O5,0,01/01/2019
3,1,OTHER,13 CONDOS - ELEVATOR APARTMENTS,2,592,1041,,R4,"1 SHERIDAN SQUARE, 8C",8C,10014.0,1.0,0.0,1.0,0,500.0,0.0,2,R4,0,01/01/2019
4,1,UPPER EAST SIDE (59-79),15 CONDOS - 2-10 UNIT RESIDENTIAL,2C,1379,1402,,R1,"20 EAST 65TH STREET, B",B,10065.0,1.0,0.0,1.0,0,6406.0,0.0,2,R1,0,01/01/2019


In [191]:
# Read shape
df.shape

(23040, 21)

In [192]:
# Convert to datetime to use for index
df['SALE_DATE'] = pd.to_datetime(df['SALE_DATE'])
df = df.set_index('SALE_DATE')

In [193]:
# First condition
condition1 = df['BUILDING_CLASS_CATEGORY'] == '01 ONE FAMILY DWELLINGS'
df = df[condition1]

In [194]:
# See shape
df.shape

(5061, 20)

In [195]:
# Second Condition
condition2 = (df['SALE_PRICE'] > 1000) & (df['SALE_PRICE'] < 2000000)
df = df[condition2]

In [196]:
# See the shape
df.shape

(3203, 20)

In [197]:
# Drop useless columns
df = df.drop('EASE-MENT', axis=1)
df =df.drop('APARTMENT_NUMBER', axis=1)

In [198]:
X = df.drop('SALE_PRICE', axis=1)
y = df['SALE_PRICE']

In [199]:
# Train test split
from sklearn import model_selection

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)

In [200]:
import numpy as np
neighborhoods = X_train.NEIGHBORHOOD[:, np.newaxis]

  


In [201]:
# Onehot neightborhood
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(sparse=False)
onehot = enc.fit_transform(neighborhoods)
# See how the onehot looks
onehot[::25]

X_train['NEIGHBORHOOD'] = onehot

neighborhoods = X_test.NEIGHBORHOOD[:, np.newaxis]
onehot = enc.fit_transform(neighborhoods)
X_test['NEIGHBORHOOD'] = onehot

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':


### Continue doing One-Hot encoding for all categorical variables

In [202]:
Building_class = X_train.BUILDING_CLASS_CATEGORY[:, np.newaxis]
onehot = enc.fit_transform(Building_class)
X_train['BUILDING_CLASS_CATEGORY'] = onehot

Building_class = X_test.BUILDING_CLASS_CATEGORY[:, np.newaxis]
onehot = enc.fit_transform(Building_class)
X_test['BUILDING_CLASS_CATEGORY'] = onehot

  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


In [203]:
tax_class = X_train.TAX_CLASS_AT_PRESENT[:, np.newaxis]
onehot = enc.fit_transform(tax_class)
X_train['TAX_CLASS_AT_PRESENT'] = onehot

tax_class = X_test.TAX_CLASS_AT_PRESENT[:, np.newaxis]
onehot = enc.fit_transform(tax_class)
X_test['TAX_CLASS_AT_PRESENT'] = onehot

  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


In [204]:
bu_class = X_train.BUILDING_CLASS_AT_PRESENT[:, np.newaxis]
onehot = enc.fit_transform(bu_class)
X_train['BUILDING_CLASS_AT_PRESENT'] = onehot

bu_class = X_test.BUILDING_CLASS_AT_PRESENT[:, np.newaxis]
onehot = enc.fit_transform(bu_class)
X_test['BUILDING_CLASS_AT_PRESENT'] = onehot

  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


In [205]:
add = X_train.ADDRESS[:, np.newaxis]
onehot = enc.fit_transform(add)
X_train['ADDRESS'] = onehot

add = X_test.ADDRESS[:, np.newaxis]
onehot = enc.fit_transform(add)
X_test['ADDRESS'] = onehot

  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


In [206]:
block = X_train.BLOCK[:, np.newaxis]
onehot = enc.fit_transform(block)
X_train['BLOCK'] = onehot

block = X_test.BLOCK[:, np.newaxis]
onehot = enc.fit_transform(block)
X_test['BLOCK'] = onehot

  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


In [207]:
bcs = X_train.BUILDING_CLASS_AT_TIME_OF_SALE	[:, np.newaxis]
onehot = enc.fit_transform(bcs)
X_train['BUILDING_CLASS_AT_TIME_OF_SALE'] = onehot

bcs = X_test.BUILDING_CLASS_AT_TIME_OF_SALE	[:, np.newaxis]
onehot = enc.fit_transform(bcs)
X_test['BUILDING_CLASS_AT_TIME_OF_SALE'] = onehot

  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


In [208]:
# See how the training data looks
X_train.head()

Unnamed: 0_level_0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,BUILDING_CLASS_AT_PRESENT,ADDRESS,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE
SALE_DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2019-03-20,5,0.0,1.0,1.0,0.0,61,0.0,0.0,10312.0,1.0,0.0,1.0,4000,1848.0,1975.0,1,0.0
2019-03-29,2,0.0,1.0,1.0,0.0,4,0.0,0.0,10459.0,1.0,0.0,1.0,1890,1152.0,1993.0,1,0.0
2019-01-08,4,0.0,1.0,1.0,0.0,15,0.0,0.0,11373.0,1.0,0.0,1.0,2142,1860.0,1940.0,1,0.0
2019-02-21,4,0.0,1.0,1.0,0.0,50,0.0,0.0,11367.0,1.0,0.0,1.0,1800,1296.0,1940.0,1,0.0
2019-01-07,4,0.0,1.0,1.0,0.0,7,0.0,0.0,11427.0,1.0,0.0,1.0,4000,1653.0,1940.0,1,0.0


In [209]:
# Remove commas
X_train['LAND_SQUARE_FEET']= X_train['LAND_SQUARE_FEET'].str.replace(',', '')
X_test['LAND_SQUARE_FEET']= X_test['LAND_SQUARE_FEET'].str.replace(',', '')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [210]:
# Convert to ints
X_train['LAND_SQUARE_FEET'] = X_train['LAND_SQUARE_FEET'].astype(int)
X_test['LAND_SQUARE_FEET']= X_test['LAND_SQUARE_FEET'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


## Feature Selection with SelectKBest

In [211]:
from sklearn.feature_selection import SelectKBest, f_regression

selector = SelectKBest(score_func=f_regression, k=5)

X_train_selected = selector.fit_transform(X_train, y_train)

selected_mask = selector.get_support()
all_features = X_train.columns
selected_feature = all_features[selected_mask]

print("The selected features: ", selected_feature)

The selected features:  Index(['TAX_CLASS_AT_PRESENT', 'ZIP_CODE', 'TOTAL_UNITS', 'LAND_SQUARE_FEET',
       'GROSS_SQUARE_FEET'],
      dtype='object')


  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)


## Ridge Regression

In [212]:
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
ridge_model = make_pipeline(PolynomialFeatures(5),
                           Ridge(normalize=True, alpha=0.05))


In [213]:
# Chose what features to use
X_train = X_train[['ZIP_CODE', 'COMMERCIAL_UNITS', 'TOTAL_UNITS', 'LAND_SQUARE_FEET',
       'GROSS_SQUARE_FEET']]
X_test = X_test[['ZIP_CODE', 'COMMERCIAL_UNITS', 'TOTAL_UNITS', 'LAND_SQUARE_FEET',
       'GROSS_SQUARE_FEET']]

In [214]:
# Fit model
ridge_model.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('polynomialfeatures',
                 PolynomialFeatures(degree=5, include_bias=True,
                                    interaction_only=False, order='C')),
                ('ridge',
                 Ridge(alpha=0.05, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=True, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [215]:
# Predict
yfit_ridge = ridge_model.predict(X_test)

In [216]:
# See what the MAE
from sklearn import metrics
print("Mean Absolute Error: ", metrics.mean_absolute_error(y_test, yfit_ridge))

Mean Absolute Error:  189671.17540799882


In [217]:
# Make the Baseline
baseline = [y_test.mean()] * len(y_test)

In [218]:
# See what the MAE is
from sklearn import metrics
print("Mean Absolute Error: ", metrics.mean_absolute_error(y_test, baseline))

Mean Absolute Error:  230105.43367918688


## Relatively lower than the Baseline! This means the model is performing quite well. Of course, we could probably maximize the performance even more with more features, but google colab will crash if I try any more of them