Lambda School Data Science

*Unit 2, Sprint 1, Module 3*

---

# Ridge Regression

## Assignment

We're going back to our other **New York City** real estate dataset. Instead of predicting apartment rents, you'll predict property sales prices.

But not just for condos in Tribeca...

- [ ] Use a subset of the data where `BUILDING_CLASS_CATEGORY` == `'01 ONE FAMILY DWELLINGS'` and the sale price was more than 100 thousand and less than 2 million.
- [ ] Do train/test split. Use data from January — March 2019 to train. Use data from April 2019 to test.
- [ ] Do one-hot encoding of categorical features.
- [ ] Do feature selection with `SelectKBest`.
- [ ] Fit a ridge regression model with multiple features. Use the `normalize=True` parameter (or do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html) beforehand — use the scaler's `fit_transform` method with the train set, and the scaler's `transform` method with the test set)
- [ ] Get mean absolute error for the test set.
- [ ] As always, commit your notebook to your fork of the GitHub repo.

The [NYC Department of Finance](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page) has a glossary of property sales terms and NYC Building Class Code Descriptions. The data comes from the [NYC OpenData](https://data.cityofnewyork.us/browse?q=NYC%20calendar%20sales) portal.


## Stretch Goals

Don't worry, you aren't expected to do all these stretch goals! These are just ideas to consider and choose from.

- [ ] Add your own stretch goal(s) !
- [ ] Instead of `Ridge`, try `LinearRegression`. Depending on how many features you select, your errors will probably blow up! 💥
- [ ] Instead of `Ridge`, try [`RidgeCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html).
- [ ] Learn more about feature selection:
    - ["Permutation importance"](https://www.kaggle.com/dansbecker/permutation-importance)
    - [scikit-learn's User Guide for Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html)
    - [mlxtend](http://rasbt.github.io/mlxtend/) library
    - scikit-learn-contrib libraries: [boruta_py](https://github.com/scikit-learn-contrib/boruta_py) & [stability-selection](https://github.com/scikit-learn-contrib/stability-selection)
    - [_Feature Engineering and Selection_](http://www.feat.engineering/) by Kuhn & Johnson.
- [ ] Try [statsmodels](https://www.statsmodels.org/stable/index.html) if you’re interested in more inferential statistical approach to linear regression and feature selection, looking at p values and 95% confidence intervals for the coefficients.
- [ ] Read [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapters 1-3, for more math & theory, but in an accessible, readable way.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [2]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [3]:
import pandas as pd
import pandas_profiling

# Read New York City property sales data
df = pd.read_csv(DATA_PATH+'condos/NYC_Citywide_Rolling_Calendar_Sales.csv')

# Change column names: replace spaces with underscores
df.columns = [col.replace(' ', '_') for col in df]

# SALE_PRICE was read as strings.
# Remove symbols, convert to integer
df['SALE_PRICE'] = (
    df['SALE_PRICE']
    .str.replace('$','')
    .str.replace('-','')
    .str.replace(',','')
    .astype(int)
)

In [4]:
# BOROUGH is a numeric column, but arguably should be a categorical feature,
# so convert it from a number to a string
df['BOROUGH'] = df['BOROUGH'].astype(str)

In [5]:
# Reduce cardinality for NEIGHBORHOOD feature

# Get a list of the top 10 neighborhoods
top10 = df['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10, 
# replace the neighborhood with 'OTHER'
df.loc[~df['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'

In [49]:
df.head(1)
df.shape

(3151, 21)

In [13]:
#filter df to only include single family dwellings
df = df[df['BUILDING_CLASS_CATEGORY'] == '01 ONE FAMILY DWELLINGS']

In [18]:
#change column titles to lower case 
df.columns = df.columns.str.lower()

In [28]:
#filter df such that 100,000 < df < 2,000,000
df = df[(df['sale_price'] > 100000) & (df['sale_price'] < 2000000)]

In [31]:
#convert 'sale_date' to datetime format
df['sale_date'] = pd.to_datetime(df['sale_date'])

In [56]:
#drop 'apartment_number' all NaN except in one row (still useless info)
df = df.drop('apartment_number', axis=1)

In [52]:
#drop 'ease-ment' column (all NaN's) 
df = df.drop('ease-ment', axis=1)

In [85]:
#rest index
df = df.reset_index()

In [124]:
df.describe(exclude='number').T

Unnamed: 0,count,unique,top,freq,first,last
borough,3151,5,4,1580,NaT,NaT
neighborhood,3151,6,OTHER,2990,NaT,NaT
tax_class_at_present,3151,2,1,3111,NaT,NaT
building_class_at_present,3151,13,A1,1185,NaT,NaT
address,3151,3135,108-16 171ST PLACE,2,NaT,NaT
building_class_at_time_of_sale,3151,11,A1,1186,NaT,NaT
sale_date,3151,91,2019-01-31 00:00:00,78,2019-01-01,2019-04-30


In [119]:
#convert 'land_square_feet' to numeric

df['land_square_feet'] = pd.to_numeric(df['land_square_feet'].apply(lambda x: x.replace(',','')))

In [123]:
#drop 'building_calss_category'

df = df.drop('building_class_category', axis=1)

In [125]:
#create train (Jan - March 2019) and test (April 2019)  df
train = df[(df['sale_date'] >= pd.Timestamp(2019,1,1)) & (df['sale_date'] <= pd.Timestamp(2019,3,31))]
test = df[(df['sale_date'] >= pd.Timestamp(2019,4,1)) & (df['sale_date'] <= pd.Timestamp(2019,4,30))]

In [126]:
#set up categorical features for OneHot Encoding
target = ['sale_price']
high_cardinality = ['address', 'sale_date']
features = train.columns.drop(high_cardinality + target)

In [130]:
#setup train and test dfs
X_train = train[features]
X_test = test[features]
y_train = train[target]
y_test = test[target]

In [131]:
#OneHotEncode the categorical columns
import category_encoders as ce

encoder = ce.OneHotEncoder(use_cat_names=True)
X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test)

In [146]:
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

#selector = SelectKBest(score_func=f_regression, k=15)

import warnings
warnings.filterwarnings("ignore")

X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

for k in range (1, len(X_train.columns)+1):
    print(f'{k} features')
    
    selector = SelectKBest(score_func=f_regression, k=k)
    X_train_selected = selector.fit_transform(X_train, y_train)
    X_test_selected = selector.transform(X_test)


    model = LinearRegression()
    model.fit(X_train_selected, y_train)
    y_pred = model.predict(X_test_selected)
    mae = mean_absolute_error(y_test, y_pred)
    print(f'Test Mean Absolute Error: ${mae:,.0f} \n')

1 features
Test Mean Absolute Error: $183,641 

2 features
Test Mean Absolute Error: $179,555 

3 features
Test Mean Absolute Error: $179,291 

4 features
Test Mean Absolute Error: $179,291 

5 features
Test Mean Absolute Error: $170,483 

6 features
Test Mean Absolute Error: $169,982 

7 features
Test Mean Absolute Error: $168,140 

8 features
Test Mean Absolute Error: $168,245 

9 features
Test Mean Absolute Error: $167,855 

10 features
Test Mean Absolute Error: $164,737 

11 features
Test Mean Absolute Error: $165,346 

12 features
Test Mean Absolute Error: $164,860 

13 features
Test Mean Absolute Error: $155,159 

14 features
Test Mean Absolute Error: $156,541 

15 features
Test Mean Absolute Error: $156,572 

16 features
Test Mean Absolute Error: $156,573 

17 features
Test Mean Absolute Error: $156,394 

18 features
Test Mean Absolute Error: $156,394 

19 features
Test Mean Absolute Error: $156,255 

20 features
Test Mean Absolute Error: $156,255 

21 features
Test Mean Absolut

Based on the for loop printed above it appears that 13 features will maximize the accuracy of the model, with the safest amount of complexity.

In [145]:
#import RidgeCV
from sklearn.linear_model import RidgeCV
#find the ideal alpha for model
alphas = [.001, .002, .003, .004]
ridge = RidgeCV(alphas=alphas, normalize=True)
ridge.fit(X_train_selected, y_train)
print(f'Alpha = {ridge.alpha_}')
#determine Mean Absolute Error
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'Mean Absolute Error = ${mae:,.0f}')

Alpha = 0.002
Mean Absolute Error = $156,469


- [x] Use a subset of the data where `BUILDING_CLASS_CATEGORY` == `'01 ONE FAMILY DWELLINGS'` and the sale price was more than 100 thousand and less than 2 million.
- [x] Do train/test split. Use data from January — March 2019 to train. Use data from April 2019 to test.
- [x] Do one-hot encoding of categorical features.
- [x] Do feature selection with `SelectKBest`.
- [ ] Fit a ridge regression model with multiple features. Use the `normalize=True` parameter (or do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html) beforehand — use the scaler's `fit_transform` method with the train set, and the scaler's `transform` method with the test set)
- [ ] Get mean absolute error for the test set.
- [ ] As always, commit your notebook to your fork of the GitHub repo.